New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8 interfering where it didn't ought to #1292
Comments
From mjtg@cus.cam.ac.uk%perl5.5.670 -wde 1 Loading DB routines from perl5db.pl version 1.05 Enter h or `h h' for help, run `perldoc perldebug' for more help. main::(-e:1): 1 DB<2> x $x As you might guess, it doesn't do that under 5.005_03 Mike Guy % perl5.5.670 -V Characteristics of this binary (from libperl): |
From [Unknown Contact. See original ticket]I've been having a somewhat frustrating time trying to follow this First, here's a simpler related example (under perl5.6.rc1 rather DB<1> x "\xff\xff\xff\0" Note we get the Malformed message as well as the original "SWASHGET". Also the argument lists in the backtrace are somewhat mangled - see The above backtrace was produced on Saturday. I next tried to close I'd originally discovered the bug while running one of my production Sigh. Just tried building a clean rc1 - and the bug shows again! Mike Guy |
From @gsarOn Mon, 13 Mar 2000 07:12:10 GMT, "M.J.T. Guy" wrote:
This shows up because the debugger goes and does: UNIVERSAL::isa("\xff\xff\xff\0", "HASH") which boils down to calling: gv_fetchpv("\xff\xff\xff\0", FALSE) But gv_fetchpv() and friends now expect a well-formed utf8 string--hence This ought to fix it. Sarathy Inline Patch-----------------------------------8<-----------------------------------
Change 5700 by gsar@auger on 2000/03/13 09:57:59
make the is_utf8_*() safe for use on invalid utf8 (they now
return false on such input instead of emitting warnings)
Affected files ...
... //depot/perl/embed.h#168 edit
... //depot/perl/embed.pl#119 edit
... //depot/perl/global.sym#132 edit
... //depot/perl/gv.c#96 edit
... //depot/perl/objXSUB.h#108 edit
... //depot/perl/perlapi.c#51 edit
... //depot/perl/pod/perlapi.pod#6 edit
... //depot/perl/proto.h#203 edit
... //depot/perl/universal.c#27 edit
... //depot/perl/utf8.c#21 edit
Differences ...
==== //depot/perl/embed.h#168 (text+w) ====
Index: perl/embed.h
--- perl/embed.h.~1~ Mon Mar 13 01:58:04 2000
+++ perl/embed.h Mon Mar 13 01:58:04 2000
@@ -300,6 +300,7 @@
#define to_uni_upper_lc Perl_to_uni_upper_lc
#define to_uni_title_lc Perl_to_uni_title_lc
#define to_uni_lower_lc Perl_to_uni_lower_lc
+#define is_utf8_char Perl_is_utf8_char
#define is_utf8_alnum Perl_is_utf8_alnum
#define is_utf8_alnumc Perl_is_utf8_alnumc
#define is_utf8_idfirst Perl_is_utf8_idfirst
@@ -1744,6 +1745,7 @@
#define to_uni_upper_lc(a) Perl_to_uni_upper_lc(aTHX_ a)
#define to_uni_title_lc(a) Perl_to_uni_title_lc(aTHX_ a)
#define to_uni_lower_lc(a) Perl_to_uni_lower_lc(aTHX_ a)
+#define is_utf8_char(a) Perl_is_utf8_char(aTHX_ a)
#define is_utf8_alnum(a) Perl_is_utf8_alnum(aTHX_ a)
#define is_utf8_alnumc(a) Perl_is_utf8_alnumc(aTHX_ a)
#define is_utf8_idfirst(a) Perl_is_utf8_idfirst(aTHX_ a)
@@ -3420,6 +3422,8 @@
#define to_uni_title_lc Perl_to_uni_title_lc
#define Perl_to_uni_lower_lc CPerlObj::Perl_to_uni_lower_lc
#define to_uni_lower_lc Perl_to_uni_lower_lc
+#define Perl_is_utf8_char CPerlObj::Perl_is_utf8_char
+#define is_utf8_char Perl_is_utf8_char
#define Perl_is_utf8_alnum CPerlObj::Perl_is_utf8_alnum
#define is_utf8_alnum Perl_is_utf8_alnum
#define Perl_is_utf8_alnumc CPerlObj::Perl_is_utf8_alnumc
==== //depot/perl/embed.pl#119 (xtext) ====
Index: perl/embed.pl
--- perl/embed.pl.~1~ Mon Mar 13 01:58:04 2000
+++ perl/embed.pl Mon Mar 13 01:58:04 2000
@@ -1597,6 +1597,7 @@
Ap |U32 |to_uni_upper_lc|U32 c
Ap |U32 |to_uni_title_lc|U32 c
Ap |U32 |to_uni_lower_lc|U32 c
+Ap |int |is_utf8_char |U8 *p
Ap |bool |is_utf8_alnum |U8 *p
Ap |bool |is_utf8_alnumc |U8 *p
Ap |bool |is_utf8_idfirst|U8 *p
==== //depot/perl/global.sym#132 (text+w) ====
Index: perl/global.sym
--- perl/global.sym.~1~ Mon Mar 13 01:58:04 2000
+++ perl/global.sym Mon Mar 13 01:58:04 2000
@@ -180,6 +180,7 @@
Perl_to_uni_upper_lc
Perl_to_uni_title_lc
Perl_to_uni_lower_lc
+Perl_is_utf8_char
Perl_is_utf8_alnum
Perl_is_utf8_alnumc
Perl_is_utf8_idfirst
==== //depot/perl/gv.c#96 (text) ====
Index: perl/gv.c
--- perl/gv.c.~1~ Mon Mar 13 01:58:04 2000
+++ perl/gv.c Mon Mar 13 01:58:04 2000
@@ -448,10 +448,10 @@
/*
=for apidoc gv_stashpv
-Returns a pointer to the stash for a specified package. If C<create> is
-set then the package will be created if it does not already exist. If
-C<create> is not set and the package does not exist then NULL is
-returned.
+Returns a pointer to the stash for a specified package. C<name> should
+be a valid UTF-8 string. If C<create> is set then the package will be
+created if it does not already exist. If C<create> is not set and the
+package does not exist then NULL is returned.
=cut
*/
@@ -494,8 +494,8 @@
/*
=for apidoc gv_stashsv
-Returns a pointer to the stash for a specified package. See
-C<gv_stashpv>.
+Returns a pointer to the stash for a specified package, which must be a
+valid UTF-8 string. See C<gv_stashpv>.
=cut
*/
==== //depot/perl/objXSUB.h#108 (text+w) ====
Index: perl/objXSUB.h
--- perl/objXSUB.h.~1~ Mon Mar 13 01:58:04 2000
+++ perl/objXSUB.h Mon Mar 13 01:58:04 2000
@@ -687,6 +687,10 @@
#define Perl_to_uni_lower_lc pPerl->Perl_to_uni_lower_lc
#undef to_uni_lower_lc
#define to_uni_lower_lc Perl_to_uni_lower_lc
+#undef Perl_is_utf8_char
+#define Perl_is_utf8_char pPerl->Perl_is_utf8_char
+#undef is_utf8_char
+#define is_utf8_char Perl_is_utf8_char
#undef Perl_is_utf8_alnum
#define Perl_is_utf8_alnum pPerl->Perl_is_utf8_alnum
#undef is_utf8_alnum
==== //depot/perl/perlapi.c#51 (text+w) ====
Index: perl/perlapi.c
--- perl/perlapi.c.~1~ Mon Mar 13 01:58:04 2000
+++ perl/perlapi.c Mon Mar 13 01:58:04 2000
@@ -1288,6 +1288,13 @@
return ((CPerlObj*)pPerl)->Perl_to_uni_lower_lc(c);
}
+#undef Perl_is_utf8_char
+int
+Perl_is_utf8_char(pTHXo_ U8 *p)
+{
+ return ((CPerlObj*)pPerl)->Perl_is_utf8_char(p);
+}
+
#undef Perl_is_utf8_alnum
bool
Perl_is_utf8_alnum(pTHXo_ U8 *p)
==== //depot/perl/pod/perlapi.pod#6 (text+w) ====
Index: perl/pod/perlapi.pod
--- perl/pod/perlapi.pod.~1~ Mon Mar 13 01:58:04 2000
+++ perl/pod/perlapi.pod Mon Mar 13 01:58:04 2000
@@ -381,17 +381,17 @@
=item gv_stashpv
-Returns a pointer to the stash for a specified package. If C<create> is
-set then the package will be created if it does not already exist. If
-C<create> is not set and the package does not exist then NULL is
-returned.
+Returns a pointer to the stash for a specified package. C<name> should
+be a valid UTF-8 string. If C<create> is set then the package will be
+created if it does not already exist. If C<create> is not set and the
+package does not exist then NULL is returned.
HV* gv_stashpv(const char* name, I32 create)
=item gv_stashsv
-Returns a pointer to the stash for a specified package. See
-C<gv_stashpv>.
+Returns a pointer to the stash for a specified package, which must be a
+valid UTF-8 string. See C<gv_stashpv>.
HV* gv_stashsv(SV* sv, I32 create)
==== //depot/perl/proto.h#203 (text+w) ====
Index: perl/proto.h
--- perl/proto.h.~1~ Mon Mar 13 01:58:04 2000
+++ perl/proto.h Mon Mar 13 01:58:04 2000
@@ -365,6 +365,7 @@
PERL_CALLCONV U32 Perl_to_uni_upper_lc(pTHX_ U32 c);
PERL_CALLCONV U32 Perl_to_uni_title_lc(pTHX_ U32 c);
PERL_CALLCONV U32 Perl_to_uni_lower_lc(pTHX_ U32 c);
+PERL_CALLCONV int Perl_is_utf8_char(pTHX_ U8 *p);
PERL_CALLCONV bool Perl_is_utf8_alnum(pTHX_ U8 *p);
PERL_CALLCONV bool Perl_is_utf8_alnumc(pTHX_ U8 *p);
PERL_CALLCONV bool Perl_is_utf8_idfirst(pTHX_ U8 *p);
==== //depot/perl/universal.c#27 (text) ====
==== //depot/perl/utf8.c#21 (text) ====
Index: perl/utf8.c
--- perl/utf8.c.~1~ Mon Mar 13 01:58:04 2000
+++ perl/utf8.c Mon Mar 13 01:58:04 2000
@@ -101,6 +101,39 @@
#endif
}
+/* Tests if some arbitrary number of bytes begins in a valid UTF-8 character.
+ * The actual number of bytes in the UTF-8 character will be returned if it
+ * is valid, otherwise 0. */
+int
+Perl_is_utf8_char(pTHX_ U8 *s)
+{
+ U8 u = *s;
+ int slen, len;
+
+ if (!(u & 0x80))
+ return 1;
+
+ if (!(u & 0x40))
+ return 0;
+
+ if (!(u & 0x20)) { len = 2; }
+ else if (!(u & 0x10)) { len = 3; }
+ else if (!(u & 0x08)) { len = 4; }
+ else if (!(u & 0x04)) { len = 5; }
+ else if (!(u & 0x02)) { len = 6; }
+ else if (!(u & 0x01)) { len = 7; }
+ else { len = 13; } /* whoa! */
+
+ slen = len - 1;
+ s++;
+ while (slen--) {
+ if ((*s & 0xc0) != 0x80)
+ return 0;
+ s++;
+ }
+ return len;
+}
+
UV
Perl_utf8_to_uv(pTHX_ U8* s, I32* retlen)
{
@@ -500,6 +533,8 @@
bool
Perl_is_utf8_alnum(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_alnum)
PL_utf8_alnum = swash_init("utf8", "IsAlnum", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_alnum, p);
@@ -515,6 +550,8 @@
bool
Perl_is_utf8_alnumc(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_alnum)
PL_utf8_alnum = swash_init("utf8", "IsAlnumC", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_alnum, p);
@@ -536,6 +573,8 @@
bool
Perl_is_utf8_alpha(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_alpha)
PL_utf8_alpha = swash_init("utf8", "IsAlpha", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_alpha, p);
@@ -544,6 +583,8 @@
bool
Perl_is_utf8_ascii(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_ascii)
PL_utf8_ascii = swash_init("utf8", "IsAscii", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_ascii, p);
@@ -552,6 +593,8 @@
bool
Perl_is_utf8_space(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_space)
PL_utf8_space = swash_init("utf8", "IsSpace", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_space, p);
@@ -560,6 +603,8 @@
bool
Perl_is_utf8_digit(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_digit)
PL_utf8_digit = swash_init("utf8", "IsDigit", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_digit, p);
@@ -568,6 +613,8 @@
bool
Perl_is_utf8_upper(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_upper)
PL_utf8_upper = swash_init("utf8", "IsUpper", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_upper, p);
@@ -576,6 +623,8 @@
bool
Perl_is_utf8_lower(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_lower)
PL_utf8_lower = swash_init("utf8", "IsLower", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_lower, p);
@@ -584,6 +633,8 @@
bool
Perl_is_utf8_cntrl(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_cntrl)
PL_utf8_cntrl = swash_init("utf8", "IsCntrl", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_cntrl, p);
@@ -592,6 +643,8 @@
bool
Perl_is_utf8_graph(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_graph)
PL_utf8_graph = swash_init("utf8", "IsGraph", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_graph, p);
@@ -600,6 +653,8 @@
bool
Perl_is_utf8_print(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_print)
PL_utf8_print = swash_init("utf8", "IsPrint", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_print, p);
@@ -608,6 +663,8 @@
bool
Perl_is_utf8_punct(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_punct)
PL_utf8_punct = swash_init("utf8", "IsPunct", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_punct, p);
@@ -616,6 +673,8 @@
bool
Perl_is_utf8_xdigit(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_xdigit)
PL_utf8_xdigit = swash_init("utf8", "IsXDigit", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_xdigit, p);
@@ -624,6 +683,8 @@
bool
Perl_is_utf8_mark(pTHX_ U8 *p)
{
+ if (!is_utf8_char(p))
+ return FALSE;
if (!PL_utf8_mark)
PL_utf8_mark = swash_init("utf8", "IsM", &PL_sv_undef, 0, 0);
return swash_fetch(PL_utf8_mark, p);
End of Patch. |
From [Unknown Contact. See original ticket]Gurusamy Sarathy <gsar@ActiveState.com> wrote
Thanks. That's OK now. And in return, here's a patch to the regression tests. Mike Guy Inline Patch--- ./t/op/universal.t.orig Fri Mar 3 15:36:55 2000
+++ ./t/op/universal.t Mon Mar 13 10:43:26 2000
@@ -102,3 +102,5 @@
test ! UNIVERSAL::can($b, "can");
test ! $a->can("export_tags"); # a method in Exporter
+
+test ! UNIVERSAL::isa("\xff\xff\xff\0", 'HASH');
End of patch |
From [Unknown Contact. See original ticket]I wrote
Rats. It'd be better if it was a correct patch. Honest, I did test Try to get it rihgt this time. Mike Guy Inline Patch--- ./t/op/universal.t.orig Tue Mar 14 13:47:48 2000
+++ ./t/op/universal.t Tue Mar 14 13:48:24 2000
@@ -8,7 +8,7 @@
unshift @INC, '../lib' if -d '../lib';
}
-print "1..72\n";
+print "1..73\n";
$a = {};
bless $a, "Bob";
@@ -102,3 +102,5 @@
test ! UNIVERSAL::can($b, "can");
test ! $a->can("export_tags"); # a method in Exporter
+
+test ! UNIVERSAL::isa("\xff\xff\xff\0", 'HASH');
End of patch |
Migrated from rt.perl.org#2289 (status was 'resolved')
Searchable as RT2289$
The text was updated successfully, but these errors were encountered: