New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8::decode doesn't work under -T #7312
Comments
From stas@stason.orgCreated by stas@rabbit.stason.orgThe bug seems to be in utf8::decode (and probably related functions) when % perl -MDevel::Peek -le '$_ = shift; Dump $_; utf8::decode $_; Dump $_' \ SV = PV(0x804df00) at 0x804dbd4 And now with -T: % perl -T -MDevel::Peek -le '$_ = shift; Dump $_; utf8::decode $_; Dump $_' \ SV = PVMG(0x8087c10) at 0x804dbd4 As you can see the first invocation w/o -T correctly decodes input, Perl Info
|
From @iabynOn Mon, May 24, 2004 at 11:30:20PM -0000, Stas Bekman wrote:
(snip)
The immediate cause is the lack of a POK flag in the tainted version Whether its right for only the pPOK flag to be on the tainted string, -- |
The RT System itself - Status changed from 'new' to 'open' |
From stas@stason.orgDave Mitchell wrote:
Thanks Dave, So it's unclear whose fault it is, the code that populates sv with pv under -T Also is it possible to feed utf8 chars directly from shell? i.e. replacing -- |
From BQW10602@nifty.comOn Sun, 30 May 2004 21:43:38 -0700
You can use <$_ = Encode::decode('utf8', $_)> Regards, |
From stas@stason.orgSADAHIRO Tomoyuki wrote:
Thank you, Sadahiro. It indeed fixes my simple test case (I haven't tested perl-5.8.0 -T -MEncode -MDevel::Peek -le '$_ = shift; Dump $_; $_ = Shouldn't sv_utf8_decode() (and potentially other utf8 functions) then be -- |
From nick@ing-simmons.netStas Bekman <stas@stason.org> writes:
sv_utf8_decode is an internals thing that is exported to help out Normally you have to do SvGETMAGIC() to get POK flag on, and I think We could change that, but that won't help older perls. |
From BQW10602@nifty.comOn Tue, 01 Jun 2004 08:50:33 +0100
I think Change 22652 fixes sv_pvutf8n_force (and SvPVutf8_force)
As I understand correctly, what current sv_utf8_decode does is I'm paying attention to SvPOK(sv) at the first conditional If SvPOK(sv) in question is replaced by SvPOKp(sv), Regards, |
From stas@stason.orgNick Ing-Simmons wrote:
Now tested the application and Encode works a treat.
If it is exported for XS code, it shouldn't end up in the perl space, no? XS If it remains in the perl space, please document that utf8::xxx is not be to
In this case I was trying to use it in perl code, so obviously I couldn't do
-- |
@rgs - Status changed from 'open' to 'resolved' |
From stas@stason.orgThe issue hasn't been resolved. utf8::decode still doesn't work under -T. -- |
1 similar comment
From stas@stason.orgThe issue hasn't been resolved. utf8::decode still doesn't work under -T. -- |
From BQW10602@nifty.comHello. Here is a patch: Since sv_utf8_decode and sv_utf8_downgrade are marked with "may change" Please note that sv_utf8_downgrade and sv_utf8_decode may fail Inline Patchdiff -urN perl~/lib/utf8.pm perl/lib/utf8.pm
--- perl~/lib/utf8.pm Sun Apr 04 22:32:36 2004
+++ perl/lib/utf8.pm Sat Jun 05 23:33:34 2004
@@ -113,45 +113,59 @@
=item * $num_octets = utf8::upgrade($string)
-Converts (in-place) internal representation of string to Perl's
-internal I<UTF-X> form. Returns the number of octets necessary to
-represent the string as I<UTF-X>. Can be used to make sure that the
-UTF-8 flag is on, so that C<\w> or C<lc()> work as expected on strings
-containing characters in the range 0x80-0xFF (oon ASCII and
-derivatives). Note that this should not be used to convert a legacy
-byte encoding to Unicode: use Encode for that. Affected by the
-encoding pragma.
+Converts in-place the octet sequence in the native encoding
+(Latin-1 or EBCDIC) to the equivalent character sequence in I<UTF-X>.
+I<$string> already encoded as characters does no harm.
+Returns the number of octets necessary to represent the string as I<UTF-X>.
+Can be used to make sure that the UTF-8 flag is on,
+so that C<\w> or C<lc()> work as Unicode on strings
+containing characters in the range 0x80-0xFF (on ASCII and
+derivatives).
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+Affected by the encoding pragma.
+
=item * $success = utf8::downgrade($string[, FAIL_OK])
-Converts (in-place) internal representation of string to be un-encoded
-bytes. Returns true on success. On failure dies or, if the value of
-FAIL_OK is true, returns false. Can be used to make sure that the
-UTF-8 flag is off, e.g. when you want to make sure that the substr()
-or length() function works with the usually faster byte algorithm.
-Note that this should not be used to convert Unicode back to a legacy
-byte encoding: use Encode for that. B<Not> affected by the encoding
-pragma.
+Converts in-place the character sequence in I<UTF-X>
+to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
+I<$string> already encoded as octets does no harm.
+Returns true on success. On failure dies or, if the value of
+C<FAIL_OK> is true, returns false.
+Can be used to make sure that the UTF-8 flag is off,
+e.g. when you want to make sure that the substr() or length() function
+works with the usually faster byte algorithm.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+B<Not> affected by the encoding pragma.
+B<NOTE:> this function is experimental and may change
+or be removed without notice.
+
=item * utf8::encode($string)
-Converts in-place the octets of the I<$string> to the octet sequence
-in Perl's I<UTF-X> encoding. Returns nothing. B<Note that this does
-not change the "type" of I<$string> to UTF-8>, and that this handles
-only ISO 8859-1 (or EBCDIC) as the source character set. Therefore
-this should not be used to convert a legacy 8-bit encoding to Unicode:
-use Encode::decode() for that. In the very limited case of wanting to
-handle just ISO 8859-1 (or EBCDIC), you could use utf8::upgrade().
+Converts in-place the character sequence to the corresponding octet sequence
+in I<UTF-X>. The UTF-8 flag is turned off. Returns nothing.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
=item * utf8::decode($string)
-Attempts to convert I<$string> in-place from Perl's I<UTF-X> encoding
-into octets. Returns nothing. B<Note that this does not change the
-"type" of <$string> from UTF-8>, and that this handles only ISO 8859-1
-(or EBCDIC) as the destination character set. Therefore this should
-not be used to convert Unicode back to a legacy 8-bit encoding:
-use Encode::encode() for that. In the very limited case of wanting
-to handle just ISO 8859-1 (or EBCDIC), you could use utf8::downgrade().
+Attempts to convert in-place the octet sequence in I<UTF-X>
+to the corresponding character sequence. The UTF-8 flag is turned on
+only if the source string contains multiple-byte I<UTF-X> characters.
+If I<$string> is invalid as I<UTF-X>, returns false; otherwise returns true.
+
+B<Note that this function does not handle arbitrary encodings.>
+Therefore I<Encode.pm> is recommended for the general purposes.
+
+B<NOTE:> this function is experimental and may change
+or be removed without notice.
=item * $flag = utf8::is_utf8(STRING)
diff -urN perl~/pod/perlapi.pod perl/pod/perlapi.pod
--- perl~/pod/perlapi.pod Thu Jun 03 02:08:58 2004
+++ perl/pod/perlapi.pod Sat Jun 05 22:44:34 2004
@@ -4834,9 +4834,11 @@
=item sv_utf8_decode
-Convert the octets in the PV from UTF-8 to chars. Scan for validity and then
-turn off SvUTF8 if needed so that we see characters. Used as a building block
-for decode_utf8 in Encode.xs
+If the PV of the SV is an octet sequence in UTF-8
+and contains a multiple-byte character, the C<SvUTF8> flag is turned on
+so that it looks like a character. If the PV contains only single-byte
+characters, the C<SvUTF8> flag stays being off.
+Scans PV for validity and returns false if the PV is invalid UTF-8.
NOTE: this function is experimental and may change or be
removed without notice.
@@ -4848,9 +4850,9 @@
=item sv_utf8_downgrade
-Attempt to convert the PV of an SV from UTF-8-encoded to byte encoding.
-This may not be possible if the PV contains non-byte encoding characters;
-if this is the case, either returns false or, if C<fail_ok> is not
+Attempts to convert the PV of an SV from characters to bytes.
+If the PV contains a character beyond byte, this conversion will fail;
+in this case, either returns false or, if C<fail_ok> is not
true, croaks.
This is not as a general purpose Unicode to byte encoding interface:
@@ -4866,9 +4868,8 @@
=item sv_utf8_encode
-Convert the PV of an SV to UTF-8-encoded, but then turn off the C<SvUTF8>
-flag so that it looks like octets again. Used as a building block
-for encode_utf8 in Encode.xs
+Converts the PV of an SV to UTF-8, but then turns the C<SvUTF8>
+flag off so that it looks like octets again.
void sv_utf8_encode(SV *sv)
@@ -4877,7 +4878,7 @@
=item sv_utf8_upgrade
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
if all the bytes have hibit clear.
@@ -4892,7 +4893,7 @@
=item sv_utf8_upgrade_flags
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
diff -urN perl~/sv.c perl/sv.c
--- perl~/sv.c Tue May 25 01:18:32 2004
+++ perl/sv.c Sat Jun 05 22:43:56 2004
@@ -3907,7 +3907,7 @@
/*
=for apidoc sv_utf8_upgrade
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
if all the bytes have hibit clear.
@@ -3917,7 +3917,7 @@
=for apidoc sv_utf8_upgrade_flags
-Convert the PV of an SV to its UTF-8-encoded form.
+Converts the PV of an SV to its UTF-8-encoded form.
Forces the SV to string form if it is not already.
Always sets the SvUTF8 flag to avoid future validity checks even
if all the bytes have hibit clear. If C<flags> has C<SV_GMAGIC> bit set,
@@ -3986,9 +3986,9 @@
/*
=for apidoc sv_utf8_downgrade
-Attempt to convert the PV of an SV from UTF-8-encoded to byte encoding.
-This may not be possible if the PV contains non-byte encoding characters;
-if this is the case, either returns false or, if C<fail_ok> is not
+Attempts to convert the PV of an SV from characters to bytes.
+If the PV contains a character beyond byte, this conversion will fail;
+in this case, either returns false or, if C<fail_ok> is not
true, croaks.
This is not as a general purpose Unicode to byte encoding interface:
@@ -4000,7 +4000,7 @@
bool
Perl_sv_utf8_downgrade(pTHX_ register SV* sv, bool fail_ok)
{
- if (SvPOK(sv) && SvUTF8(sv)) {
+ if (SvPOKp(sv) && SvUTF8(sv)) {
if (SvCUR(sv)) {
U8 *s;
STRLEN len;
@@ -4030,9 +4030,8 @@
/*
=for apidoc sv_utf8_encode
-Convert the PV of an SV to UTF-8-encoded, but then turn off the C<SvUTF8>
-flag so that it looks like octets again. Used as a building block
-for encode_utf8 in Encode.xs
+Converts the PV of an SV to UTF-8, but then turns the C<SvUTF8>
+flag off so that it looks like octets again.
=cut
*/
@@ -4053,9 +4052,11 @@
/*
=for apidoc sv_utf8_decode
-Convert the octets in the PV from UTF-8 to chars. Scan for validity and then
-turn off SvUTF8 if needed so that we see characters. Used as a building block
-for decode_utf8 in Encode.xs
+If the PV of the SV is an octet sequence in UTF-8
+and contains a multiple-byte character, the C<SvUTF8> flag is turned on
+so that it looks like a character. If the PV contains only single-byte
+characters, the C<SvUTF8> flag stays being off.
+Scans PV for validity and returns false if the PV is invalid UTF-8.
=cut
*/
@@ -4063,7 +4064,7 @@
bool
Perl_sv_utf8_decode(pTHX_ register SV *sv)
{
- if (SvPOK(sv)) {
+ if (SvPOKp(sv)) {
U8 *c;
U8 *e;
diff -urN perl~/t/op/utftaint.t perl/t/op/utftaint.t
--- perl~/t/op/utftaint.t Tue May 25 01:33:40 2004
+++ perl/t/op/utftaint.t Sat Jun 05 22:55:58 2004
@@ -23,12 +23,17 @@
use Scalar::Util qw(tainted);
use Test;
-plan tests => 3*10;
+plan tests => 3*10 + 3*8 + 2*16;
my $cnt = 0;
my $arg = $ENV{PATH}; # a tainted value
use constant UTF8 => "\x{1234}";
+sub is_utf8 {
+ my $s = shift;
+ return 0xB6 != ord pack('a*', chr(0xB6).$s);
+}
+
for my $ary ([ascii => 'perl'], [latin1 => "\xB6"], [utf8 => "\x{100}"]) {
my $encode = $ary->[0];
my $string = $ary->[1];
@@ -40,7 +45,7 @@
my $lconcat = $taint;
$lconcat .= UTF8;
- print $lconcat eq $string."\x{1234}"
+ print $lconcat eq $string.UTF8
? "ok " : "not ok ", ++$cnt, " # compare: $encode, concat left\n";
print tainted($lconcat) == tainted($arg)
@@ -48,7 +53,7 @@
my $rconcat = UTF8;
$rconcat .= $taint;
- print $rconcat eq "\x{1234}".$string
+ print $rconcat eq UTF8.$string
? "ok " : "not ok ", ++$cnt, " # compare: $encode, concat right\n";
print tainted($rconcat) == tainted($arg)
@@ -71,3 +76,111 @@
print tainted($taint) == tainted($arg)
? "ok " : "not ok ", ++$cnt, " # tainted: $encode, after test\n";
}
+
+
+for my $ary ([ascii => 'perl'], [latin1 => "\xB6"], [utf8 => "\x{100}"]) {
+ my $encode = $ary->[0];
+
+ my $utf8 = pack('U*') . $ary->[1];
+ my $byte = pack('C0a*', $utf8);
+
+ my $taint = $arg; substr($taint, 0) = $utf8;
+ utf8::encode($taint);
+
+ print $taint eq $byte
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, encode utf8\n";
+
+ print pack('a*',$taint) eq pack('a*',$byte)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, encode utf8\n";
+
+ print !is_utf8($taint)
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, encode utf8\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, encode utf8\n";
+
+ my $taint = $arg; substr($taint, 0) = $byte;
+ utf8::decode($taint);
+
+ print $taint eq $utf8
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, decode byte\n";
+
+ print pack('a*',$taint) eq pack('a*',$utf8)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, decode byte\n";
+
+ print is_utf8($taint) eq ($encode ne 'ascii')
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, decode byte\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, decode byte\n";
+}
+
+
+for my $ary ([ascii => 'perl'], [latin1 => "\xB6"]) {
+ my $encode = $ary->[0];
+
+ my $up = pack('U*') . $ary->[1];
+ my $down = pack('C0a*', $ary->[1]);
+
+ my $taint = $arg; substr($taint, 0) = $up;
+ utf8::upgrade($taint);
+
+ print $taint eq $up
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, upgrade up\n";
+
+ print pack('a*',$taint) eq pack('a*',$up)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, upgrade up\n";
+
+ print is_utf8($taint)
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, upgrade up\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, upgrade up\n";
+
+ my $taint = $arg; substr($taint, 0) = $down;
+ utf8::upgrade($taint);
+
+ print $taint eq $up
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, upgrade down\n";
+
+ print pack('a*',$taint) eq pack('a*',$up)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, upgrade down\n";
+
+ print is_utf8($taint)
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, upgrade down\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, upgrade down\n";
+
+ my $taint = $arg; substr($taint, 0) = $up;
+ utf8::downgrade($taint);
+
+ print $taint eq $down
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, downgrade up\n";
+
+ print pack('a*',$taint) eq pack('a*',$down)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, downgrade up\n";
+
+ print !is_utf8($taint)
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, downgrade up\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, downgrade up\n";
+
+ my $taint = $arg; substr($taint, 0) = $down;
+ utf8::downgrade($taint);
+
+ print $taint eq $down
+ ? "ok " : "not ok ", ++$cnt, " # compare: $encode, downgrade down\n";
+
+ print pack('a*',$taint) eq pack('a*',$down)
+ ? "ok " : "not ok ", ++$cnt, " # bytecmp: $encode, downgrade down\n";
+
+ print !is_utf8($taint)
+ ? "ok " : "not ok ", ++$cnt, " # is_utf8: $encode, downgrade down\n";
+
+ print tainted($taint) == tainted($arg)
+ ? "ok " : "not ok ", ++$cnt, " # tainted: $encode, downgrade down\n";
+}
+
+
Regards, |
From @rgsSADAHIRO Tomoyuki wrote:
Thanks, applied as change #22902.
|
From stas@stason.orgRafael Garcia-Suarez wrote:
It works for me. Thanks a lot, SADAHIRO! -- |
Migrated from rt.perl.org#29841 (status was 'resolved')
Searchable as RT29841$
The text was updated successfully, but these errors were encountered: