New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching upper ASCII characters from file in RE patterns #10867
Comments
From pool@utilika.orgThe attached script unibug.pl, which reads from the attached file unibug.txt, demonstrates a problem in Perl 5.10.0 which Karl Williamson says is still present in 5.13.7. It matches the input line against 7 regular-expression patterns, 1-7. Patterns 3 and 7 should fail to match; the others should match. However: With "use utf8", pattern 3 matches instead of failing. With "use encoding 'utf8'" (or with both pragmas), pattern 3 matches instead of failing, and patterns 4, 5, and 6 fail instead of matching. Karl Williamson has provided two additional files for demonstrating this problem: nobreak_utf8.pl and nobreak_latin1.pl. |
From pool@utilika.orgˉ |
From @ikegamiOn Tue, Nov 30, 2010 at 4:57 PM, Jonathan Pool <perlbug-followup@perl.org>wrote:
Bad test print ('3. The NBS is ' . (/[\7f-\x80]/ ? '' : 'NOT ') . 'matched by should be print ('3. The NBS is ' . (/[\x7f-\x80]/ ? '' : 'NOT ') . 'matched by
I'm not sure if that's a bug, or if it's broken by design. - Eric |
The RT System itself - Status changed from 'new' to 'open' |
From BQW10602@nifty.comOn Tue, 30 Nov 2010 21:26:20 -0500
This seems be able to be explained and perhaps not a bug. According to POD of encoding.pm, Then, under use encoding "utf8", U+00A0 in Unicode should be "\xC2\xA0". The reason why /[\x7F-\x80]/ matches U+00A0 is that /[\x7F-\x{FFFD}]/ Regards, |
From @khwilliamsonSADAHIRO Tomoyuki wrote:
I understand the rest of this post, but I don't understande the
I looked at some debug info, and see that you are correct. Jonathan,
|
From @ikegamiOn Thu, Dec 9, 2010 at 12:58 AM, karl williamson <public@khwilliamson.com>wrote:
He was quoting from the docs. His post was of the form "The docs say |
From @ikegamiOn Thu, Dec 9, 2010 at 1:35 AM, Jonathan Pool <pool@utilika.org> wrote:
The file is being read in without issue. The problem is with the literals in It then operates on the scalar.
You explicitly stated you wanted different behaviour from the literal by perl -e'use encoding "utf8"; qr/[\x7F-\x80]' means perl -e'qr/{{{decode("utf8", "[\x7F-\x80]")}}}/' which becomes perl -e'qr/[\x7F-\x{FFFD}]/' The effect of "use encoding" on \x escapes in literals and the like is why |
From BQW10602@nifty.comOn Wed, 08 Dec 2010 22:58:20 -0700
Just only 8859-7 and euc-jp are mentioned in the doc. We wish the POD would include not only concrete examples Regards, |
From @ikegamiOn Thu, Dec 9, 2010 at 11:58 AM, Jonathan Pool <pool@utilika.org> wrote:
C2 80 is the UTF-8 encoding of U+0080, so the following are equivalent: $x = "\x80"; and use encoding 'UTF-8'; (Except perhaps in how the UTF8 flag is set, but that's not suppose to make - Eric |
From @ikegamiClosing: Working as designed. |
From [Unknown Contact. See original ticket]Closing: Working as designed. |
@ikegami - Status changed from 'open' to 'rejected' |
From @ikegamiOn Thu, Dec 9, 2010 at 5:04 PM, Jonathan Pool <pool@utilika.org> wrote:
This does not relate to Perl developement. Please take it to elsewhere, such |
From pool@utilika.org
The script reads a line from a UTF8-encoded file into a Perl scalar. It then operates on the scalar. In man perlunicode, one reads: "Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. ... Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is logically just a number ranging from 0 to 2**31 or so. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is "\x{263A}". This encoding scheme only works for all characters ...." This documentation tells me that the way to refer to a Unicode character (once it is in a string that has been assigned to a Perl scalar) is by its Unicode codepoint, not by its UTF-8 encoding. A white smiling face has codepoint U+263a, but it has UTF-8 encoding e298ba. The documentation tells me to refer to that character with \x{263a}, not with \x{e298ba}. As you say, \x80 is not a legal UTF-8 encoding, but it is a legal (even though unnamed) Unicode character codepoint. So on the basis of the documentation I would expect Perl to recognize it as such and not to convert \x80 to \x{fffd}. Illustration: perl -wE 'binmode STDOUT, ":utf8"; use utf8; say "\x{263a}";' If I'm mistaken about any of the above, I'll be grateful to be corrected. Thanks for your help. |
From pool@utilika.org
Thank you for this explanation. So, is it possible for the source code (in a UTF-8 file) to use \x80 (or any numeric \x escape) to represent the character U+0080? |
From pool@utilika.org
Could the latter representation (\xc2\x80) appear in a regular-expression character class, too? |
From BQW10602@nifty.comOn Thu, 9 Dec 2010 14:04:04 -0800
Could with perl 5.8.0, 5.8.1, 5.8.3, 5.8.8. #!perl perl 5.008 perl 5.008001 perl 5.008003 perl 5.008008 perl 5.008009 perl 5.010000 perl 5.010001 |
From BQW10602@nifty.comOn Sat, 11 Dec 2010 11:47:55 +0900
Sadly, it has been broken in a character class. In an older perl (to 5.8.8), In a newer perl (from 5.8.9), #!perl my $u00e1 = "\N{LATIN SMALL LETTER A WITH ACUTE}"; # U+00E1 print "string-eq: "; perl 5.008 perl 5.008001 perl 5.008003 perl 5.008008 perl 5.008009 perl 5.010000 perl 5.010001 |
From @demerphqOn 9 December 2010 08:16, Eric Brine <ikegami@adaelis.com> wrote:
Yes, for many including me, it seems rather insane, I guess for some Also, and much worse is that at least up until 5.10 this insane Fixed sometime since then as its not in blead, but i havent checked $ perl -v && perl -le'use encoding "iso 8859-7"; $a = "\xDF"; This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi Copyright 1987-2009, Larry Wall Perl may be copied only under the terms of either the Artistic License or the Complete documentation for Perl, including FAQ lists, should be found on 0x03af $ ./perl -v && ./perl -Ilib -le'use encoding "iso 8859-7"; $a = This is perl 5, version 13, subversion 7 (v5.13.7-265-gb1811a1*) built Copyright 1987-2010, Larry Wall Perl may be copied only under the terms of either the Artistic License or the Complete documentation for Perl, including FAQ lists, should be found on 0x03af -- |
From BQW10602@nifty.comOn Sat, 11 Dec 2010 14:43:16 +0100
Dan kogai, the maintainer of Encode including encoding.pm, said in his blog - The encoding pragma has been originally developped for the purpose at 19:15, 22 June 2007 - "Use encoding;" should be used only when you need to rewrite at 14:30, 08 June 2009 If he had intended encoding.pm mainly for the purpose to make Regards, |
From BQW10602@nifty.comOn Sat, 11 Dec 2010 14:43:16 +0100
This fix seems happen between 5.11.4 and 5.11.5 by which makes \N{U+XX} syntax always have Unicode semantics Perhaps the Encode maintainer also wouldn't consider |
Migrated from rt.perl.org#80030 (status was 'rejected')
Searchable as RT80030$
The text was updated successfully, but these errors were encountered: