Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8::valid considers illegal characters to be valid #8944

Closed
p5pRT opened this issue Jun 22, 2007 · 10 comments
Closed

utf8::valid considers illegal characters to be valid #8944

p5pRT opened this issue Jun 22, 2007 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 22, 2007

Migrated from rt.perl.org#43294 (status was 'resolved')

Searchable as RT43294$

@p5pRT
Copy link
Author

p5pRT commented Jun 22, 2007

From jgmyers@proofpoint.com

Created by jgmyers@pong.us.proofpoint.com

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

The following patch tightens up the validity checks to exclude such
illegal and ill-formed characters. Applying it causes a couple of
perl's harness tests to fail as those tests incorrectly expect to be
able to process surrogates and illegal characters.

This also brings up the separate issue that the "chr" function should
probably throw a warning when asked to create a character that
Perl_uvuni_to_utf8_flags would warn about.

Inline Patch
--- perl-5.8.8-attrib/utf8.h    2006-06-26 15:34:05.000000000 -0700
+++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14:18:26.000000000 -0700
@@ -276,15 +276,13 @@
         (p)[2] >= 0x80 && (p)[2] <= 0xBF)
 #define IS_UTF8_CHAR_3c(p)     \
        ((p)[0] == 0xED && \
-        (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
-        (p)[2] >= 0x80 && (p)[2] <= 0xBF)
-/* In IS_UTF8_CHAR_3c(p) one could use
- * (p)[1] >= 0x80 && (p)[1] <= 0x9F
- * if one wanted to exclude surrogates. */
+        (p)[1] >= 0x80 && (p)[1] <= 0x9F)
 #define IS_UTF8_CHAR_3d(p)     \
        ((p)[0] >= 0xEE && (p)[0] <= 0xEF && \
         (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
-        (p)[2] >= 0x80 && (p)[2] <= 0xBF)
+        (p)[2] >= 0x80 && (p)[2] <= 0xBF && \
+        ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \
+                            ((p)[1] != 0xB7 || (p)[2] <= 0x8F || (p)[2] 
 >= 0xB0))))
 #define IS_UTF8_CHAR_4a(p)     \
        ((p)[0] == 0xF0 && \
         (p)[1] >= 0x90 && (p)[1] <= 0xBF && \
@@ -315,9 +313,9 @@
         IS_UTF8_CHAR_3c(p) || \
         IS_UTF8_CHAR_3d(p))
 #define IS_UTF8_CHAR_4(p)      \
-       (IS_UTF8_CHAR_4a(p) || \
-        IS_UTF8_CHAR_4b(p) || \
-        IS_UTF8_CHAR_4c(p))
+       ((IS_UTF8_CHAR_4a(p) || \
+         IS_UTF8_CHAR_4b(p) || \
+         IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD || 
((p)[1] & 0xf) != 0xf))

/* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it
  * (1) allows UTF-8 encoded UTF-16 surrogates

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.8.8:

Configured by jgmyers at Tue Feb 13 10:14:49 PST 2007.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
  Platform:
    osname=linux, osvers=2.6.9-42.0.8.elsmp, 
archname=i686-linux-thread-multi
    uname='linux pong 2.6.9-42.0.8.elsmp #1 smp tue jan 30 12:33:47 est 
2007 i686 i686 i386 gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define 
usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc-4.1', ccflags ='-D_REENTRANT -D_GNU_SOURCE 
-DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe 
-Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING 
-fno-strict-aliasing -pipe -Wdeclaration-after-statement 
-I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.1.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc-4.1', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.3.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.8:
    /u/jgmyers/perl/lib/5.8.8/i686-linux-thread-multi
    /u/jgmyers/perl/lib/5.8.8
    /u/jgmyers/perl/lib/site_perl/5.8.8/i686-linux-thread-multi
    /u/jgmyers/perl/lib/site_perl/5.8.8
    /u/jgmyers/perl/lib/site_perl/5.8.7/i686-linux-thread-multi
    /u/jgmyers/perl/lib/site_perl/5.8.7
    /u/jgmyers/perl/lib/site_perl/5.8.6/i686-linux-thread-multi
    /u/jgmyers/perl/lib/site_perl/5.8.6
    /u/jgmyers/perl/lib/site_perl/5.8.5/i686-linux-thread-multi
    /u/jgmyers/perl/lib/site_perl/5.8.5
    /u/jgmyers/perl/lib/site_perl/5.8.3/i686-linux-thread-multi
    /u/jgmyers/perl/lib/site_perl/5.8.3
    /u/jgmyers/perl/lib/site_perl
    .


Environment for perl v5.8.8:
    HOME=/u/jgmyers
    LANG=en_US
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/jgmyers/50/src/Linux2.6/pps/lib
    LOGDIR (unset)
    
PATH=/tools/x/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/u/jgmyers/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2007

From james@mastros.biz

On Fri, Jun 22, 2007 at 02​:31​:52PM -0700, John Gardiner Myers wrote​:

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

This sounds like a feature, not a bug. Perl uses utf8 to mean a
nonstrict superset of UTF-8. UTF-8 is the strict version. It sounds
like the bug may be in Perl_uvuni_to_utf8_flags being too picky.

BTW, Java-based applications may store characters above U+FFFF as utf8
with surrogates.

  Just my 1p,
  -=- James Mastros

@p5pRT
Copy link
Author

p5pRT commented Jun 23, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 20, 2008

From jgmyers@proofpoint.com

This sounds like a feature, not a bug. Perl uses utf8 to mean a
nonstrict superset of UTF-8. UTF-8 is the strict version. It sounds
like the bug may be in Perl_uvuni_to_utf8_flags being too picky.

As of Unicode 3.1, implemetations are prohibited from interpreting
surrogates in UTF-8. See Unicode 5.0 conformance requirement C10 and
the Definition of the UTF-8 encoding form D92.

This is a security requirement, otherwise an attacker could get
prohibited content through a security syntax check that operates on
UTF-8 data.

@p5pRT
Copy link
Author

p5pRT commented Sep 18, 2013

From @jkeenan

On Fri Jun 22 14​:31​:51 2007, jgmyers wrote​:

This is a bug report for perl from jgmyers@​pong.us.proofpoint.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

The following patch tightens up the validity checks to exclude such
illegal and ill-formed characters. Applying it causes a couple of
perl's harness tests to fail as those tests incorrectly expect to be
able to process surrogates and illegal characters.

This also brings up the separate issue that the "chr" function should
probably throw a warning when asked to create a character that
Perl_uvuni_to_utf8_flags would warn about.

--- perl-5.8.8-attrib/utf8.h 2006-06-26 15​:34​:05.000000000 -0700
+++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14​:18​:26.000000000 -0700
@​@​ -276,15 +276,13 @​@​
(p)[2] >= 0x80 && (p)[2] <= 0xBF)
#define IS_UTF8_CHAR_3c(p) \
((p)[0] == 0xED && \
- (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
-/* In IS_UTF8_CHAR_3c(p) one could use
- * (p)[1] >= 0x80 && (p)[1] <= 0x9F
- * if one wanted to exclude surrogates. */
+ (p)[1] >= 0x80 && (p)[1] <= 0x9F)
#define IS_UTF8_CHAR_3d(p) \
((p)[0] >= 0xEE && (p)[0] <= 0xEF && \
(p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
+ (p)[2] >= 0x80 && (p)[2] <= 0xBF && \
+ ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \
+ ((p)[1] != 0xB7 || (p)[2] <= 0x8F ||
(p)[2]

= 0xB0))))
#define IS_UTF8_CHAR_4a(p) \
((p)[0] == 0xF0 && \
(p)[1] >= 0x90 && (p)[1] <= 0xBF && \
@​@​ -315,9 +313,9 @​@​
IS_UTF8_CHAR_3c(p) || \
IS_UTF8_CHAR_3d(p))
#define IS_UTF8_CHAR_4(p) \
- (IS_UTF8_CHAR_4a(p) || \
- IS_UTF8_CHAR_4b(p) || \
- IS_UTF8_CHAR_4c(p))
+ ((IS_UTF8_CHAR_4a(p) || \
+ IS_UTF8_CHAR_4b(p) || \
+ IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD ||
((p)[1] & 0xf) != 0xf))

/* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it
* (1) allows UTF-8 encoded UTF-16 surrogates

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=medium
---
Site configuration information for perl v5.8.8​:

Configured by jgmyers at Tue Feb 13 10​:14​:49 PST 2007.

Discussion in this RT petered out five years ago. Is there anyone
familiar with UTF-8 issues who could review the discussion and recommend
an action?

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2013

From @cpansprout

On Wed Sep 18 16​:32​:08 2013, jkeenan wrote​:

On Fri Jun 22 14​:31​:51 2007, jgmyers wrote​:

This is a bug report for perl from jgmyers@​pong.us.proofpoint.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

The following patch tightens up the validity checks to exclude such
illegal and ill-formed characters. Applying it causes a couple of
perl's harness tests to fail as those tests incorrectly expect to be
able to process surrogates and illegal characters.

This also brings up the separate issue that the "chr" function should
probably throw a warning when asked to create a character that
Perl_uvuni_to_utf8_flags would warn about.

--- perl-5.8.8-attrib/utf8.h 2006-06-26 15​:34​:05.000000000 -0700
+++ perl-5.8.8-utf8valid/utf8.h 2007-06-22 14​:18​:26.000000000 -0700
@​@​ -276,15 +276,13 @​@​
(p)[2] >= 0x80 && (p)[2] <= 0xBF)
#define IS_UTF8_CHAR_3c(p) \
((p)[0] == 0xED && \
- (p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
-/* In IS_UTF8_CHAR_3c(p) one could use
- * (p)[1] >= 0x80 && (p)[1] <= 0x9F
- * if one wanted to exclude surrogates. */
+ (p)[1] >= 0x80 && (p)[1] <= 0x9F)
#define IS_UTF8_CHAR_3d(p) \
((p)[0] >= 0xEE && (p)[0] <= 0xEF && \
(p)[1] >= 0x80 && (p)[1] <= 0xBF && \
- (p)[2] >= 0x80 && (p)[2] <= 0xBF)
+ (p)[2] >= 0x80 && (p)[2] <= 0xBF && \
+ ((p)[0] != 0xEF || (((p)[1] != 0xBF || (p)[2] <= 0xBD) && \
+ ((p)[1] != 0xB7 || (p)[2] <= 0x8F ||
(p)[2]

= 0xB0))))
#define IS_UTF8_CHAR_4a(p) \
((p)[0] == 0xF0 && \
(p)[1] >= 0x90 && (p)[1] <= 0xBF && \
@​@​ -315,9 +313,9 @​@​
IS_UTF8_CHAR_3c(p) || \
IS_UTF8_CHAR_3d(p))
#define IS_UTF8_CHAR_4(p) \
- (IS_UTF8_CHAR_4a(p) || \
- IS_UTF8_CHAR_4b(p) || \
- IS_UTF8_CHAR_4c(p))
+ ((IS_UTF8_CHAR_4a(p) || \
+ IS_UTF8_CHAR_4b(p) || \
+ IS_UTF8_CHAR_4c(p)) && ((p)[2] != 0xBF || (p)[3] <= 0xBD ||
((p)[1] & 0xf) != 0xf))

/* IS_UTF8_CHAR(p) is strictly speaking wrong (not UTF-8) because it
* (1) allows UTF-8 encoded UTF-16 surrogates

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=medium
---
Site configuration information for perl v5.8.8​:

Configured by jgmyers at Tue Feb 13 10​:14​:49 PST 2007.

Discussion in this RT petered out five years ago. Is there anyone
familiar with UTF-8 issues who could review the discussion and recommend
an action?

Yes. It’s a bit confusing. Perl strings can contain characters that
are not valid Unicode characters; that is not a bug, because Perl
strings are used for data other than Unicode. The utf8​::valid function
appears to have been added for the sake of XS modules (and the core
itself) so they can check in their tests that the scalars they are
producing contain valid byte sequences when the SvUTF8 flag is on. I.e.,
the check is​: Is this a valid scalar? Does it contain a logical
sequence of characters (as opposed to being mangled internally)?

The OP here seems to have misunderstood the function’s purpose, and
would like it to validate strict utf-8, which is something that Encode
already provides, and which utf8​::valid does not do and was not intended
for.

So I think we can reject it as not-a-bug. But perhaps the documentation
in utf8.pm could use some clarification. (It seems perfectly clear to
me, but the fact that someone else misunderstood it suggests that that
may be just me.)

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2013

From @ikegami

On Wed, Sep 18, 2013 at 9​:05 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Wed Sep 18 16​:32​:08 2013, jkeenan wrote​:

On Fri Jun 22 14​:31​:51 2007, jgmyers wrote​:

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.
[...]

Yes. It’s a bit confusing. Perl strings can contain characters that
are not valid Unicode characters; that is not a bug, because Perl
strings are used for data other than Unicode. The utf8​::valid function
appears to have been added for the sake of XS modules (and the core
itself) so they can check in their tests that the scalars they are
producing contain valid byte sequences when the SvUTF8 flag is on. I.e.,
the check is​: Is this a valid scalar? Does it contain a logical
sequence of characters (as opposed to being mangled internally)?

The OP here seems to have misunderstood the function’s purpose, and

would like it to validate strict utf-8, which is something that Encode
already provides, and which utf8​::valid does not do and was not intended
for.

So I think we can reject it as not-a-bug. But perhaps the documentation
in utf8.pm could use some clarification. (It seems perfectly clear to
me, but the fact that someone else misunderstood it suggests that that
may be just me.)

The ticket also mentioned utf8​::decode. The docs for utf8​::decode claims it
decodes UTF-X. It actually decodes utf8. (e.g. utf8​::encode and
utf8​::decode will happily roundtrip 0x20_000.) That's probably a good thing.

@p5pRT
Copy link
Author

p5pRT commented Oct 20, 2013

From @jkeenan

On Wed Sep 18 18​:05​:55 2013, sprout wrote​:

On Wed Sep 18 16​:32​:08 2013, jkeenan wrote​:

On Fri Jun 22 14​:31​:51 2007, jgmyers wrote​:

This is a bug report for perl from jgmyers@​pong.us.proofpoint.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

The following patch tightens up the validity checks to exclude such
illegal and ill-formed characters. Applying it causes a couple of
perl's harness tests to fail as those tests incorrectly expect to be
able to process surrogates and illegal characters.

This also brings up the separate issue that the "chr" function should
probably throw a warning when asked to create a character that
Perl_uvuni_to_utf8_flags would warn about.

[snip]

Discussion in this RT petered out five years ago. Is there anyone
familiar with UTF-8 issues who could review the discussion and recommend
an action?

Yes. It’s a bit confusing. Perl strings can contain characters that
are not valid Unicode characters; that is not a bug, because Perl
strings are used for data other than Unicode. The utf8​::valid function
appears to have been added for the sake of XS modules (and the core
itself) so they can check in their tests that the scalars they are
producing contain valid byte sequences when the SvUTF8 flag is on. I.e.,
the check is​: Is this a valid scalar? Does it contain a logical
sequence of characters (as opposed to being mangled internally)?

The OP here seems to have misunderstood the function’s purpose, and
would like it to validate strict utf-8, which is something that Encode
already provides, and which utf8​::valid does not do and was not intended
for.

So I think we can reject it as not-a-bug. But perhaps the documentation
in utf8.pm could use some clarification. (It seems perfectly clear to
me, but the fact that someone else misunderstood it suggests that that
may be just me.)

The documentation for utf8​::valid has, in fact, been updated a couple of
times since the OP filed the bug back in 2007. It currently reads​:

######
[INTERNAL] Test whether I<$string> is in a consistent state regarding
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8
flag on B<or> if I<$string> is held as bytes (both these states are
'consistent'). Main reason for this routine is to allow Perl's test
suite to check that operations have left strings in a consistent state.
You most probably want to use utf8​::is_utf8() instead.
#####

I think this paragraph is satisfactory as is because (a) it suggests
that utf8​::valid is mainly for Perl 5's internal use and (b) it suggests
a different function which is probably what the average user should use.

Since the people who have most recently modified this paragraph are khw
and ikegami, each of whom is knowledgeable in this area, any further
improvements are likely to be touchups which won't warrant keeping this
RT open. Accordingly, I am taking this ticket for the purpose of
closing it within 7 days.

Thank you very much.
Jim Keenan

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2013

From @jkeenan

On Sat Oct 19 19​:08​:12 2013, jkeenan wrote​:

On Wed Sep 18 18​:05​:55 2013, sprout wrote​:

On Wed Sep 18 16​:32​:08 2013, jkeenan wrote​:

On Fri Jun 22 14​:31​:51 2007, jgmyers wrote​:

This is a bug report for perl from jgmyers@​pong.us.proofpoint.com,
generated with the help of perlbug 1.35 running under perl v5.8.8.

-----------------------------------------------------------------
[Please enter your report here]

This bug is similar to bug #38722. utf8​::valid() and utf8​::decode()
incorrectly consider illegal characters and surrogates as being valid.
A script that depends on using these functions to validate untrusted
input will then have the resulting invalid unicode strings throw
warnings out of Perl_uvuni_to_utf8_flags in later processing.

The following patch tightens up the validity checks to exclude such
illegal and ill-formed characters. Applying it causes a couple of
perl's harness tests to fail as those tests incorrectly expect to be
able to process surrogates and illegal characters.

This also brings up the separate issue that the "chr" function should
probably throw a warning when asked to create a character that
Perl_uvuni_to_utf8_flags would warn about.

[snip]

Discussion in this RT petered out five years ago. Is there anyone
familiar with UTF-8 issues who could review the discussion and recommend
an action?

Yes. It’s a bit confusing. Perl strings can contain characters that
are not valid Unicode characters; that is not a bug, because Perl
strings are used for data other than Unicode. The utf8​::valid function
appears to have been added for the sake of XS modules (and the core
itself) so they can check in their tests that the scalars they are
producing contain valid byte sequences when the SvUTF8 flag is on. I.e.,
the check is​: Is this a valid scalar? Does it contain a logical
sequence of characters (as opposed to being mangled internally)?

The OP here seems to have misunderstood the function’s purpose, and
would like it to validate strict utf-8, which is something that Encode
already provides, and which utf8​::valid does not do and was not intended
for.

So I think we can reject it as not-a-bug. But perhaps the documentation
in utf8.pm could use some clarification. (It seems perfectly clear to
me, but the fact that someone else misunderstood it suggests that that
may be just me.)

The documentation for utf8​::valid has, in fact, been updated a couple of
times since the OP filed the bug back in 2007. It currently reads​:

######
[INTERNAL] Test whether I<$string> is in a consistent state regarding
UTF-8. Will return true if it is well-formed UTF-8 and has the UTF-8
flag on B<or> if I<$string> is held as bytes (both these states are
'consistent'). Main reason for this routine is to allow Perl's test
suite to check that operations have left strings in a consistent state.
You most probably want to use utf8​::is_utf8() instead.
#####

I think this paragraph is satisfactory as is because (a) it suggests
that utf8​::valid is mainly for Perl 5's internal use and (b) it suggests
a different function which is probably what the average user should use.

Since the people who have most recently modified this paragraph are khw
and ikegami, each of whom is knowledgeable in this area, any further
improvements are likely to be touchups which won't warrant keeping this
RT open. Accordingly, I am taking this ticket for the purpose of
closing it within 7 days.

Thank you very much.
Jim Keenan

No objection heard. Closing ticket and turning my attention to the Toronto.pm meeting, now in progress!

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2013

@jkeenan - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant