Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching upper ASCII characters from file in RE patterns #10867

Closed
p5pRT opened this issue Nov 30, 2010 · 23 comments
Closed

Matching upper ASCII characters from file in RE patterns #10867

p5pRT opened this issue Nov 30, 2010 · 23 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 30, 2010

Migrated from rt.perl.org#80030 (status was 'rejected')

Searchable as RT80030$

@p5pRT
Copy link
Author

p5pRT commented Nov 30, 2010

From pool@utilika.org

The attached script unibug.pl, which reads from the attached file unibug.txt, demonstrates a problem in Perl 5.10.0 which Karl Williamson says is still present in 5.13.7.

It matches the input line against 7 regular-expression patterns, 1-7. Patterns 3 and 7 should fail to match; the others should match.

However​:

With "use utf8", pattern 3 matches instead of failing.

With "use encoding 'utf8'" (or with both pragmas), pattern 3 matches instead of failing, and patterns 4, 5, and 6 fail instead of matching.

Karl Williamson has provided two additional files for demonstrating this problem​: nobreak_utf8.pl and nobreak_latin1.pl.

@p5pRT
Copy link
Author

p5pRT commented Nov 30, 2010

From pool@utilika.org

Archive.zip

@p5pRT
Copy link
Author

p5pRT commented Nov 30, 2010

From pool@utilika.org

ˉ

@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2010

From @ikegami

On Tue, Nov 30, 2010 at 4​:57 PM, Jonathan Pool <perlbug-followup@​perl.org>wrote​:

# New Ticket Created by Jonathan Pool
# Please include the string​: [perl #80030]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=80030 >

The attached script unibug.pl, which reads from the attached file
unibug.txt, demonstrates a problem in Perl 5.10.0 which Karl Williamson says
is still present in 5.13.7.

It matches the input line against 7 regular-expression patterns, 1-7.
Patterns 3 and 7 should fail to match; the others should match.

However​:

With "use utf8", pattern 3 matches instead of failing.

Bad test

  print ('3. The NBS is ' . (/[\7f-\x80]/ ? '' : 'NOT ') . 'matched by
/[\7f-\x80]/' . "\n");

should be

  print ('3. The NBS is ' . (/[\x7f-\x80]/ ? '' : 'NOT ') . 'matched by
/[\7f-\x80]/' . "\n");

With "use encoding 'utf8'" (or with both pragmas), [...] patterns 4, 5, and
6 fail instead of matching.

I'm not sure if that's a bug, or if it's broken by design.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 1, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 4, 2010

From BQW10602@nifty.com

On Tue, 30 Nov 2010 21​:26​:20 -0500
Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Nov 30, 2010 at 4​:57 PM, Jonathan Pool <perlbug-followup@​perl.org>wrote​:

print \('3\. The NBS is ' \. \(/\[\\x7f\-\\x80\]/ ? '' : 'NOT '\) \. 'matched by

/[\7f-\x80]/' . "\n");

With "use encoding 'utf8'" (or with both pragmas), [...] patterns 4, 5, and
6 fail instead of matching.

I'm not sure if that's a bug, or if it's broken by design.

- Eric

This seems be able to be explained and perhaps not a bug.

According to POD of encoding.pm,
(see http​://search.cpan.org/~dankogai/Encode-2.40/encoding.pm )
"\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode
"\xA4\xA1" under use encoding "euc-jp" is \x{3041} in Unicode

Then, under use encoding "utf8", U+00A0 in Unicode should be "\xC2\xA0".
Use of "\xA0" expecting U+00A0 is wrong.

The reason why /[\x7F-\x80]/ matches U+00A0 is that /[\x7F-\x{FFFD}]/
matches U+00A0 as \x80 is malform as utf8 and replaced with \x{FFFD}.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @khwilliamson

SADAHIRO Tomoyuki wrote​:

On Tue, 30 Nov 2010 21​:26​:20 -0500
Eric Brine <ikegami@​adaelis.com> wrote​:

On Tue, Nov 30, 2010 at 4​:57 PM, Jonathan Pool <perlbug-followup@​perl.org>wrote​:

print \('3\. The NBS is ' \. \(/\[\\x7f\-\\x80\]/ ? '' : 'NOT '\) \. 'matched by

/[\7f-\x80]/' . "\n");

With "use encoding 'utf8'" (or with both pragmas), [...] patterns 4, 5, and
6 fail instead of matching.

I'm not sure if that's a bug, or if it's broken by design.

- Eric

This seems be able to be explained and perhaps not a bug.

According to POD of encoding.pm,
(see http​://search.cpan.org/~dankogai/Encode-2.40/encoding.pm )
"\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode
"\xA4\xA1" under use encoding "euc-jp" is \x{3041} in Unicode

I understand the rest of this post, but I don't understande the
relevance of 8859-7 and euc-jp to the discussion. Please enlighten me.

Then, under use encoding "utf8", U+00A0 in Unicode should be "\xC2\xA0".
Use of "\xA0" expecting U+00A0 is wrong.

The reason why /[\x7F-\x80]/ matches U+00A0 is that /[\x7F-\x{FFFD}]/
matches U+00A0 as \x80 is malform as utf8 and replaced with \x{FFFD}.

I looked at some debug info, and see that you are correct. Jonathan,
you said that the encoding was utf8, but \x80 is not a legal
utf8-encoded character. But it should have warned that it was
substituting FFFD.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @ikegami

On Thu, Dec 9, 2010 at 12​:58 AM, karl williamson <public@​khwilliamson.com>wrote​:

SADAHIRO Tomoyuki wrote​:

According to POD of encoding.pm, (see
http​://search.cpan.org/~dankogai/Encode-2.40/encoding.pm<http​://search.cpan.org/%7Edankogai/Encode-2.40/encoding.pm>)
"\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode
"\xA4\xA1" under use encoding "euc-jp" is \x{3041} in Unicode

I understand the rest of this post, but I don't understande the relevance
of 8859-7 and euc-jp to the discussion. Please enlighten me.

He was quoting from the docs. His post was of the form "The docs say
encoding does X for 'iso 8859-7', so encoding does Y for 'UTF-8'".

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @ikegami

On Thu, Dec 9, 2010 at 1​:35 AM, Jonathan Pool <pool@​utilika.org> wrote​:

Jonathan, you said that the encoding was utf8, but \x80 is not a legal
utf8-encoded character. But it should have warned that it was substituting
FFFD.

The script reads a line from a UTF8-encoded file into a Perl scalar.

The file is being read in without issue. The problem is with the literals in
the source file.

It then operates on the scalar.

In man perlunicode, one reads​: "Unless explicitly stated, Perl operators
use [...]

You explicitly stated you wanted different behaviour from the literal by
using "use encoding".

perl -e'use encoding "utf8"; qr/[\x7F-\x80]'

means

perl -e'qr/{{{decode("utf8", "[\x7F-\x80]")}}}/'

which becomes

perl -e'qr/[\x7F-\x{FFFD}]/'

The effect of "use encoding" on \x escapes in literals and the like is why
some people avoid "use encoding".

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From BQW10602@nifty.com

On Wed, 08 Dec 2010 22​:58​:20 -0700
karl williamson <public@​khwilliamson.com> wrote​:

SADAHIRO Tomoyuki wrote​:

This seems be able to be explained and perhaps not a bug.

According to POD of encoding.pm,
(see http​://search.cpan.org/~dankogai/Encode-2.40/encoding.pm )
"\xDF" under use encoding "iso 8859-7" is \x{3af} in Unicode
"\xA4\xA1" under use encoding "euc-jp" is \x{3041} in Unicode

I understand the rest of this post, but I don't understande the
relevance of 8859-7 and euc-jp to the discussion. Please enlighten me.

Just only 8859-7 and euc-jp are mentioned in the doc.

We wish the POD would include not only concrete examples
but also descriptive explanations how the pragma works.
However perl core has too many PL_encoding macros
to make the document precise.
The number of PL_encoding means how many different behaviors
between perl codes with encoding.pm and those without it.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @ikegami

On Thu, Dec 9, 2010 at 11​:58 AM, Jonathan Pool <pool@​utilika.org> wrote​:

The file is being read in without issue. The problem is with the literals
in the source file.

You explicitly stated you wanted different behaviour from the literal by
using "use encoding".

perl -e'use encoding "utf8"; qr/[\x7F-\x80]'

means

perl -e'qr/{{{decode("utf8", "[\x7F-\x80]")}}}/'

which becomes

perl -e'qr/[\x7F-\x{FFFD}]/'

The effect of "use encoding" on \x escapes in literals and the like is
why some people avoid "use encoding".

Thank you for this explanation.

So, is it possible for the source code (in a UTF-8 file) to use \x80 (or
any numeric \x escape) to represent the character U+0080?
ˉ

C2 80 is the UTF-8 encoding of U+0080, so the following are equivalent​:

$x = "\x80";

and

use encoding 'UTF-8';
$x = "\xC2\x80";

(Except perhaps in how the UTF8 flag is set, but that's not suppose to make
a difference.)

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @ikegami

Closing​: Working as designed.

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From [Unknown Contact. See original ticket]

Closing​: Working as designed.

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

@ikegami - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Dec 9, 2010
@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2010

From @ikegami

On Thu, Dec 9, 2010 at 5​:04 PM, Jonathan Pool <pool@​utilika.org> wrote​:

Could the latter representation (\xc2\x80) appear in a regular-expression
character class, too?

This does not relate to Perl developement. Please take it to elsewhere, such
as http​://www.perlmonks.org/

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From pool@utilika.org

Jonathan, you said that the encoding was utf8, but \x80 is not a legal utf8-encoded character. But it should have warned that it was substituting FFFD.

The script reads a line from a UTF8-encoded file into a Perl scalar.

It then operates on the scalar.

In man perlunicode, one reads​:

"Unless explicitly stated, Perl operators use character semantics for Unicode data and byte semantics for non-Unicode data. ... Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is logically just a number ranging from 0 to 2**31 or so. The Unicode code for the desired character, in hexadecimal, should be placed in the braces. For instance, a smiley face is "\x{263A}". This encoding scheme only works for all characters ...."

This documentation tells me that the way to refer to a Unicode character (once it is in a string that has been assigned to a Perl scalar) is by its Unicode codepoint, not by its UTF-8 encoding. A white smiling face has codepoint U+263a, but it has UTF-8 encoding e298ba. The documentation tells me to refer to that character with \x{263a}, not with \x{e298ba}.

As you say, \x80 is not a legal UTF-8 encoding, but it is a legal (even though unnamed) Unicode character codepoint. So on the basis of the documentation I would expect Perl to recognize it as such and not to convert \x80 to \x{fffd}.

Illustration​:

perl -wE 'binmode STDOUT, "​:utf8"; use utf8; say "\x{263a}";'

perl -wE 'binmode STDOUT, "​:utf8"; use utf8; say "\x{e298ba}";'
?????

If I'm mistaken about any of the above, I'll be grateful to be corrected.

Thanks for your help.
ˉ

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From pool@utilika.org

The file is being read in without issue. The problem is with the literals in the source file.

You explicitly stated you wanted different behaviour from the literal by using "use encoding".

perl -e'use encoding "utf8"; qr/[\x7F-\x80]'

means

perl -e'qr/{{{decode("utf8", "[\x7F-\x80]")}}}/'

which becomes

perl -e'qr/[\x7F-\x{FFFD}]/'

The effect of "use encoding" on \x escapes in literals and the like is why some people avoid "use encoding".

Thank you for this explanation.

So, is it possible for the source code (in a UTF-8 file) to use \x80 (or any numeric \x escape) to represent the character U+0080?
ˉ

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From pool@utilika.org

So, is it possible for the source code (in a UTF-8 file) to use \x80 (or any numeric \x escape) to represent the character U+0080?

C2 80 is the UTF-8 encoding of U+0080, so the following are equivalent​:

$x = "\x80";

and

use encoding 'UTF-8';
$x = "\xC2\x80";

(Except perhaps in how the UTF8 flag is set, but that's not suppose to make a difference.)

- Eric

Could the latter representation (\xc2\x80) appear in a regular-expression character class, too?

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From BQW10602@nifty.com

On Thu, 9 Dec 2010 14​:04​:04 -0800
Jonathan Pool <pool@​utilika.org> wrote​:

use encoding 'UTF-8';
$x = "\xC2\x80";

(Except perhaps in how the UTF8 flag is set, but that's not suppose to make a difference.)

- Eric

Could the latter representation (\xc2\x80) appear in a regular-expression character class, too?

Could with perl 5.8.0, 5.8.1, 5.8.3, 5.8.8.
Cannot with perl 5.8.9, 5.10.0, 5.10.1.
(I didn't run with other versions.)

#!perl
use strict;
use warnings;
use charnames '​:full';
use encoding 'UTF-8';
print "perl $]\n";
print "a\N{NO-BREAK SPACE}z" =~ /a\xC2\xA0z/ ? "ok\n" : "not ok\n";
__END__

perl 5.008
ok

perl 5.008001
ok

perl 5.008003
ok

perl 5.008008
ok

perl 5.008009
not ok

perl 5.010000
not ok

perl 5.010001
not ok

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From BQW10602@nifty.com

On Sat, 11 Dec 2010 11​:47​:55 +0900
SADAHIRO Tomoyuki <bqw10602@​nifty.com> wrote​:

Could the latter representation (\xc2\x80) appear in a regular-expression character class, too?

Could with perl 5.8.0, 5.8.1, 5.8.3, 5.8.8.
Cannot with perl 5.8.9, 5.10.0, 5.10.1.
(I didn't run with other versions.)

Sadly, it has been broken in a character class.
[\xHH\xHH] was not interpreted as a multi-octet character "\xHH\xHH"
under use encoding "a-multi-octet-encoding".

In an older perl (to 5.8.8),
  /\xC2\xA0/ matched only U+00C2. (by design)
  /\xE1\x80\x80/ matched only U+1000. (by design)
  /[\xC2\xA0]/ matched U+00C2 and U+00A0. (broken)
  /[\xE1\x80\x80]/ matched U+00E1 and U+0080. (broken)

In a newer perl (from 5.8.9),
  /\xC2\xA0/ matches only "\x{FFFD}"x2. (broken)
  /\xE1\x80\x80/ matches only "\x{FFFD}"x3. (broken)
  /[\xE1\x80\x80]/ and /[\xC2\xA0]/ match only U+FFFD. (broken)

#!perl
use strict;
use warnings;
use charnames '​:full';
use encoding 'UTF-8';
print "perl $]\n";

my $u00e1 = "\N{LATIN SMALL LETTER A WITH ACUTE}"; # U+00E1

print "string-eq​: ";
print "a\x{1000}z" eq "a\xE1\x80\x80z" ? "ok\n" : "not ok\n";
print "reg-exact​: ";
print "a\x{1000}z" =~ /a\xE1\x80\x80z/ ? "ok\n" : "not ok\n";
print "reg-class​: ";
print "a\x{1000}z" =~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
print " vs 00E1​: ";
print "a${u00e1}z" !~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
print " vs FFFD​: ";
print "a\x{FFFD}z" !~ /a[\xE1\x80\x80]z/ ? "ok\n" : "not ok\n";
__END__

perl 5.008
string-eq​: ok
reg-exact​: ok
reg-class​: not ok
  vs 00E1​: not ok
  vs FFFD​: ok

perl 5.008001
string-eq​: ok
reg-exact​: ok
reg-class​: not ok
  vs 00E1​: not ok
  vs FFFD​: ok

perl 5.008003
string-eq​: ok
reg-exact​: ok
reg-class​: not ok
  vs 00E1​: not ok
  vs FFFD​: ok

perl 5.008008
string-eq​: ok
reg-exact​: ok
reg-class​: not ok
  vs 00E1​: not ok
  vs FFFD​: ok

perl 5.008009
string-eq​: ok
reg-exact​: not ok
reg-class​: not ok
  vs 00E1​: ok
  vs FFFD​: not ok

perl 5.010000
string-eq​: ok
reg-exact​: not ok
reg-class​: not ok
  vs 00E1​: ok
  vs FFFD​: not ok

perl 5.010001
string-eq​: ok
reg-exact​: not ok
reg-class​: not ok
  vs 00E1​: ok
  vs FFFD​: not ok

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From @demerphq

On 9 December 2010 08​:16, Eric Brine <ikegami@​adaelis.com> wrote​:

On Thu, Dec 9, 2010 at 1​:35 AM, Jonathan Pool <pool@​utilika.org> wrote​:

Jonathan, you said that the encoding was utf8, but \x80 is not a legal
utf8-encoded character.  But it should have warned that it was substituting
FFFD.

The script reads a line from a UTF8-encoded file into a Perl scalar.

The file is being read in without issue. The problem is with the literals in
the source file.

It then operates on the scalar.

In man perlunicode, one reads​: "Unless explicitly stated, Perl operators
use [...]

You explicitly stated you wanted different behaviour from the literal by
using "use encoding".

perl -e'use encoding "utf8"; qr/[\x7F-\x80]'

means

perl -e'qr/{{{decode("utf8", "[\x7F-\x80]")}}}/'

which becomes

perl -e'qr/[\x7F-\x{FFFD}]/'

The effect of "use encoding" on \x escapes in literals and the like is why
some people avoid "use encoding".

Yes, for many including me, it seems rather insane, I guess for some
it makes sense, but I really wish they had picked a different escape
to use than remapping \x{}.

Also, and much worse is that at least up until 5.10 this insane
remapping of codepoints also affected​: \N{U+$codepoint} syntax.

Fixed sometime since then as its not in blead, but i havent checked
when or if it is fixed in 5.12.

$ perl -v && perl -le'use encoding "iso 8859-7"; $a = "\xDF";
$b="\N{U+DF}"; printf "0x%04x\n", ord for $a,$b'

This is perl, v5.10.1 (*) built for i486-linux-gnu-thread-multi

Copyright 1987-2009, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http​://www.perl.org/, the Perl Home Page.

0x03af
0x03af

$ ./perl -v && ./perl -Ilib -le'use encoding "iso 8859-7"; $a =
"\xDF"; $b="\N{U+DF}"; printf "0x%04x\n", ord for $a,$b'

This is perl 5, version 13, subversion 7 (v5.13.7-265-gb1811a1*) built
for i686-linux
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http​://www.perl.org/, the Perl Home Page.

0x03af
0x00df

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2010

From BQW10602@nifty.com

On Sat, 11 Dec 2010 14​:43​:16 +0100
demerphq <demerphq@​gmail.com> wrote​:

On 9 December 2010 08​:16, Eric Brine <ikegami@​adaelis.com> wrote​:
...

The effect of "use encoding" on \x escapes in literals and the like is why
some people avoid "use encoding".

Yes, for many including me, it seems rather insane, I guess for some
it makes sense, but I really wish they had picked a different escape
to use than remapping \x{}.

Dan kogai, the maintainer of Encode including encoding.pm, said in his blog
entitled "404 Blog Not Found",

  - The encoding pragma has been originally developped for the purpose
  of the transition from localized perl scripts such as jperl (japanized
  perl) to perl 5.8.
  "Use encoding 'UTF-8'" is beyond the scope of its purpose.
  "Use utf8" is enough to write perl codes in UTF-8.

  at 19​:15, 22 June 2007
  perl - no encoding; # whenever possible
  http​://blog.livedoor.jp/dankogai/archives/50857509.html
  (in Japanese)

  - "Use encoding;" should be used only when you need to rewrite
  an old code to run under a newer perl.

  at 14​:30, 08 June 2009
  perl - use encoding; # WA KURO REKISHI
  (translation​: perl - use encoding; # is a black history)
  http​://blog.livedoor.jp/dankogai/archives/51221731.html
  (in Japanese)

If he had intended encoding.pm mainly for the purpose to make
porting localized codes from byte semantics to unicode semantics
more easily, it would make sense that encoding.pm automatically
converts not only raw literals but also metacharacters such as \xHH,
that have been used earlier than 5.6.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Dec 12, 2010

From BQW10602@nifty.com

On Sat, 11 Dec 2010 14​:43​:16 +0100
demerphq <demerphq@​gmail.com> wrote​:
...

Also, and much worse is that at least up until 5.10 this insane
remapping of codepoints also affected​: \N{U+$codepoint} syntax.

Fixed sometime since then as its not in blead, but i havent checked
when or if it is fixed in 5.12.

$ perl -v && perl -le'use encoding "iso 8859-7"; $a = "\xDF";
$b="\N{U+DF}"; printf "0x%04x\n", ord for $a,$b'
...

This fix seems happen between 5.11.4 and 5.11.5 by
  PATCH​: [perl #56444] delayed interpolation of \N{...}
  http​://perl5.git.perl.org/perl.git/commit/ff3f963aa0f95ea53996b6a3842b824504b57c79

which makes \N{U+XX} syntax always have Unicode semantics
and prevents the block (in toke.c​:S_scan_const())​:
  if (PL_encoding && !has_utf8) {
  sv_recode_to_utf8(sv, PL_encoding);
from recoding \xDF to \x{3AF} under "use encoding 'iso 8859-7'".

Perhaps the Encode maintainer also wouldn't consider
whether "\N{U+DF}" should be equivalent to "\xDF".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant