New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8::valid considers illegal characters to be valid #8944
Comments
From jgmyers@proofpoint.comCreated by jgmyers@pong.us.proofpoint.comThis bug is similar to bug #38722. utf8::valid() and utf8::decode() The following patch tightens up the validity checks to exclude such This also brings up the separate issue that the "chr" function should Inline Patch--- perl-5.8.8-attrib/utf8.h 2006-06-26 15:34:05.000000000 -0700
+++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14:18:26.000000000 -0700
@@ -276,15 +276,13 @@
(p)[2] >= 0x80 && (p)[2] <= 0xBF)
#define IS_UTF8_CHAR_3c(p) \
((p)[0] == 0xED && \
- (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
-/* In IS_UTF8_CHAR_3c(p) one could use
- * (p)[1] >= 0x80 && (p)[1] <= 0x9F
- * if one wanted to exclude surrogates. */
+ (p)[1] >= 0x80 && (p)[1] <= 0x9F)
#define IS_UTF8_CHAR_3d(p) \
((p)[0] >= 0xEE && (p)[0] <= 0xEF && \
(p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
+ (p)[2] >= 0x80 && (p)[2] <= 0xBF && \
+ ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \
+ ((p)[1] != 0xB7 || (p)[2] <= 0x8F || (p)[2]
>= 0xB0))))
#define IS_UTF8_CHAR_4a(p) \
((p)[0] == 0xF0 && \
(p)[1] >= 0x90 && (p)[1] <= 0xBF && \
@@ -315,9 +313,9 @@
IS_UTF8_CHAR_3c(p) || \
IS_UTF8_CHAR_3d(p))
#define IS_UTF8_CHAR_4(p) \
- (IS_UTF8_CHAR_4a(p) || \
- IS_UTF8_CHAR_4b(p) || \
- IS_UTF8_CHAR_4c(p))
+ ((IS_UTF8_CHAR_4a(p) || \
+ IS_UTF8_CHAR_4b(p) || \
+ IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD ||
/* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it Perl Info
|
From james@mastros.bizOn Fri, Jun 22, 2007 at 02:31:52PM -0700, John Gardiner Myers wrote:
This sounds like a feature, not a bug. Perl uses utf8 to mean a BTW, Java-based applications may store characters above U+FFFF as utf8 Just my 1p, |
The RT System itself - Status changed from 'new' to 'open' |
From jgmyers@proofpoint.com
As of Unicode 3.1, implemetations are prohibited from interpreting This is a security requirement, otherwise an attacker could get |
From @jkeenanOn Fri Jun 22 14:31:51 2007, jgmyers wrote:
Discussion in this RT petered out five years ago. Is there anyone Thank you very much. |
From @cpansproutOn Wed Sep 18 16:32:08 2013, jkeenan wrote:
Yes. It’s a bit confusing. Perl strings can contain characters that The OP here seems to have misunderstood the function’s purpose, and So I think we can reject it as not-a-bug. But perhaps the documentation -- Father Chrysostomos |
From @ikegamiOn Wed, Sep 18, 2013 at 9:05 PM, Father Chrysostomos via RT <
The OP here seems to have misunderstood the function’s purpose, and
The ticket also mentioned utf8::decode. The docs for utf8::decode claims it |
From @jkeenanOn Wed Sep 18 18:05:55 2013, sprout wrote:
[snip]
The documentation for utf8::valid has, in fact, been updated a couple of ###### I think this paragraph is satisfactory as is because (a) it suggests Since the people who have most recently modified this paragraph are khw Thank you very much. |
From @jkeenanOn Sat Oct 19 19:08:12 2013, jkeenan wrote:
No objection heard. Closing ticket and turning my attention to the Toronto.pm meeting, now in progress! |
@jkeenan - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#43294 (status was 'resolved')
Searchable as RT43294$
The text was updated successfully, but these errors were encountered: