Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re: Unicode in Perl 5.6 - broken? #1918

Closed
p5pRT opened this issue May 1, 2000 · 3 comments
Closed

Re: Unicode in Perl 5.6 - broken? #1918

p5pRT opened this issue May 1, 2000 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented May 1, 2000

Migrated from rt.perl.org#3188 (status was 'resolved')

Searchable as RT3188$

@p5pRT
Copy link
Author

p5pRT commented May 1, 2000

From @simoncozens

On Mon, May 01, 2000 at 10​:30​:59PM +0200, Bertilo Wennergren wrote​:

I just started experimentng with the new Unicode features
of Perl 5.6 using the ActiveState Perl 5.6 in a Windows 98
environment. I found some serious problems. Either my
understanding of a few things is severly broken, or else
Perl 5.6 is.

My main problem concerns upper/lower case translation.
Things just don't behave in an understandable way, and
some things seem seriously broken.

To summarize the problems I've met I concocted the following
CGI-script. The result is meant to be read with at browser
that can deal with UTF-8, and that can show the necessary
glyphs for characters from Latin-1 Supplement and Latin
Extended A.

The code uses UTF-8 and won't make much sense if your
reader can't interpret UTF-8.

-------------------------------------------------------

#!/usr/bin/perl

use strict;
use utf8;

my ($string, $upper_string1, $upper_string2);

$string = "abc ABC åäö ÅÄÖ �%5]m ��$4\l";

$upper_string1 = $string;
$upper_string1 =~ s/(\w)/\U$1\E/g;

$upper_string2 = $string;
$upper_string2 = uc($upper_string2);

my ($read_string, $upper_read_string1, $upper_read_string2);

# THE FILE "read_string.txt" SHOULD CONTAIN THE SAME
# LINE OF CHARACTERS AS "$string" ABOVE - ENCODED IN UTF-8.

Okay, this is likely to be one problem. Perl can't currently
read in a file encoded in UTF8. I have some patches to fix
this but it's not a very good solution; we're working towards
something a lot more general which will allow Perl to manipulate
file input and output in a variety of ways. I'm committed to
fixing this problem in the next release, and I'll be working
flat-out on it from next week.

The rest of the problems look really really weird, and I
don't have time to go into them this morning so I'm copying
this to perlbug so it gets a bug ticket.

Simon

open(READ,"read_string.txt");
$read_string = <READ>;
close READ;

$upper_read_string1 = $read_string;
$upper_read_string1 =~ s/(\w)/\U$1\E/g;

$upper_read_string2 = $read_string;
$upper_read_string2 = uc($upper_read_string2);

print qq(Content-type​: text/html; charset=utf-8\n\n);

print qq(<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n);
print qq( "http​://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n);
print qq(<html xmlns="http​://www.w3.org/1999/xhtml">\n);
print qq(<head>\n);
print qq(<title>Test</title>\n);
print qq(</head>\n);
print qq(<body>\n);

print qq(<p>STRING​: $string</p\n>);
print qq(<p>UPPERCASED STRING 1​: $upper_string1</p>\n);
print qq(<p>UPPERCASED STRING 2​: $upper_string2</p>\n);

print qq(<p>READ STRING​: $read_string</p\n>);
print qq(<p>UPPERCASED READ STRING 1​: $upper_read_string1</p>\n);
print qq(<p>UPPERCASED READ STRING 2​: $upper_read_string2</p>\n);

print qq(</body>\n);
print qq(</html>\n);

-------------------------------------------------------

If you manage to get this code to work in a CGI environment,
you'll see that the uppercasing of the string turns out in
several different ways, and that it is correct in only half of the
time. The "\U" method works only for the string that is read from
a file. The "uc()" method works only for the other string (declared
in the perl code.

This is what I get in my broser​:

-------------------------------------------------------

STRING​: abc ABC åäö ÅÄÖ �%5]m ��$4\l

UPPERCASED STRING 1​: AFC ABC åÄ1ö 01Ä01 �1%41]l1 �01$01\01

UPPERCASED STRING 2​: ABC ABC ÅÄÖ ÅÄÖ ��$4\l ��$4\l

READ STRING​: abc ABC åäö ÅÄÖ �%5]m ��$4\l

UPPERCASED READ STRING 1​: ABC ABC ÅÄÖ ÅÄÖ ��$4\l ��$4\l

UPPERCASED READ STRING 2​: ABC ABC åäö ÅÄÖ �%5]m ��$4\l

-------------------------------------------------------

The second line is particulary horrible. Note that "b" is
uppercased to "F"! What is going on here?

If the line "use utf8" is removed things look a bit less broken at
first glance, but actually uppercasing of non-ASCII characters
doesn't work at all without the utf8 pragma.

Is it me or is it perl?

My Perl version says​:

This is perl, v5.6.0 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2000, Larry Wall

Binary build 613 provided by ActiveState Tool Corp.
http​://www.ActiveState.com
Built 12​:36​:25 Mar 24 2000

Bye!
--
"He was a modest, good-humored boy. It was Oxford that made him insufferable."

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2002

From @jhi

I believe this problem should've been fixed (as Simon Cozens described)
by the Perl 5.8.0 release.

Marking the ticket as 'resolved'.

@p5pRT
Copy link
Author

p5pRT commented Nov 24, 2002

@jhi - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant