Re: Unicode in Perl 5.6 - broken? #1918

p5pRT · 2000-05-01T17:56:03Z

Migrated from rt.perl.org#3188 (status was 'resolved')

Searchable as RT3188$

p5pRT · 2000-05-01T17:56:03Z

From @simoncozens

On Mon, May 01, 2000 at 10:30:59PM +0200, Bertilo Wennergren wrote:

I just started experimentng with the new Unicode features
of Perl 5.6 using the ActiveState Perl 5.6 in a Windows 98
environment. I found some serious problems. Either my
understanding of a few things is severly broken, or else
Perl 5.6 is.

My main problem concerns upper/lower case translation.
Things just don't behave in an understandable way, and
some things seem seriously broken.

To summarize the problems I've met I concocted the following
CGI-script. The result is meant to be read with at browser
that can deal with UTF-8, and that can show the necessary
glyphs for characters from Latin-1 Supplement and Latin
Extended A.

The code uses UTF-8 and won't make much sense if your
reader can't interpret UTF-8.

-------------------------------------------------------

#!/usr/bin/perl

use strict;
use utf8;

my ($string, $upper_string1, $upper_string2);

$string = "abc ABC åäö ÅÄÖ �%5]m ��$4\l";

$upper_string1 = $string;
$upper_string1 =~ s/(\w)/\U$1\E/g;

$upper_string2 = $string;
$upper_string2 = uc($upper_string2);

my ($read_string, $upper_read_string1, $upper_read_string2);

# THE FILE "read_string.txt" SHOULD CONTAIN THE SAME
# LINE OF CHARACTERS AS "$string" ABOVE - ENCODED IN UTF-8.

Okay, this is likely to be one problem. Perl can't currently
read in a file encoded in UTF8. I have some patches to fix
this but it's not a very good solution; we're working towards
something a lot more general which will allow Perl to manipulate
file input and output in a variety of ways. I'm committed to
fixing this problem in the next release, and I'll be working
flat-out on it from next week.

The rest of the problems look really really weird, and I
don't have time to go into them this morning so I'm copying
this to perlbug so it gets a bug ticket.

Simon

open(READ,"read_string.txt");
$read_string = <READ>;
close READ;

$upper_read_string1 = $read_string;
$upper_read_string1 =~ s/(\w)/\U$1\E/g;

$upper_read_string2 = $read_string;
$upper_read_string2 = uc($upper_read_string2);

print qq(Content-type: text/html; charset=utf-8\n\n);

print qq(<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n);
print qq( "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n);
print qq(<html xmlns="http://www.w3.org/1999/xhtml">\n);
print qq(<head>\n);
print qq(<title>Test</title>\n);
print qq(</head>\n);
print qq(<body>\n);

print qq(STRING: $string</p\n>);
print qq(UPPERCASED STRING 1: $upper_string1\n);
print qq(UPPERCASED STRING 2: $upper_string2\n);

print qq(READ STRING: $read_string</p\n>);
print qq(UPPERCASED READ STRING 1: $upper_read_string1\n);
print qq(UPPERCASED READ STRING 2: $upper_read_string2\n);

print qq(</body>\n);
print qq(</html>\n);

-------------------------------------------------------

If you manage to get this code to work in a CGI environment,
you'll see that the uppercasing of the string turns out in
several different ways, and that it is correct in only half of the
time. The "\U" method works only for the string that is read from
a file. The "uc()" method works only for the other string (declared
in the perl code.

This is what I get in my broser:

-------------------------------------------------------

STRING: abc ABC åäö ÅÄÖ �%5]m ��$4\l

UPPERCASED STRING 1: AFC ABC åÄ1ö 01Ä01 �1%41]l1 �01$01\01

UPPERCASED STRING 2: ABC ABC ÅÄÖ ÅÄÖ ��$4\l ��$4\l

READ STRING: abc ABC åäö ÅÄÖ �%5]m ��$4\l

UPPERCASED READ STRING 1: ABC ABC ÅÄÖ ÅÄÖ ��$4\l ��$4\l

UPPERCASED READ STRING 2: ABC ABC åäö ÅÄÖ �%5]m ��$4\l

-------------------------------------------------------

The second line is particulary horrible. Note that "b" is
uppercased to "F"! What is going on here?

If the line "use utf8" is removed things look a bit less broken at
first glance, but actually uppercasing of non-ASCII characters
doesn't work at all without the utf8 pragma.

Is it me or is it perl?

My Perl version says:

This is perl, v5.6.0 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2000, Larry Wall

Binary build 613 provided by ActiveState Tool Corp.
http://www.ActiveState.com
Built 12:36:25 Mar 24 2000

Bye!
--
"He was a modest, good-humored boy. It was Oxford that made him insufferable."

p5pRT · 2002-11-24T17:17:21Z

From @jhi

I believe this problem should've been fixed (as Simon Cozens described)
by the Perl 5.8.0 release.

Marking the ticket as 'resolved'.

p5pRT · 2002-11-24T17:17:21Z

@jhi - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Nov 24, 2002

p5pRT added Severity Low distro-unknown labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re: Unicode in Perl 5.6 - broken? #1918

Re: Unicode in Perl 5.6 - broken? #1918

p5pRT commented May 1, 2000

p5pRT commented May 1, 2000

p5pRT commented Nov 24, 2002

p5pRT commented Nov 24, 2002

Re: Unicode in Perl 5.6 - broken? #1918

Re: Unicode in Perl 5.6 - broken? #1918

Comments

p5pRT commented May 1, 2000

p5pRT commented May 1, 2000

From @simoncozens

p5pRT commented Nov 24, 2002

From @jhi

p5pRT commented Nov 24, 2002