lc() + Latin-1 chars is failing erratically #8253

p5pRT · 2005-12-21T03:22:45Z

Migrated from rt.perl.org#37999 (status was 'resolved')

Searchable as RT37999$

p5pRT · 2005-12-21T03:22:46Z

From skunk@iskunk.org

I have a script that is processing a list of words in Latin-1 encoding.
It is taking one word from each line, lowercasing it, and writing it
out.

I had found that certain accented letters in a word were not being
lowercased by lc(), even though other (ASCII) letters in the same word
were. At first I thought that an encoding issue was to blame, but after
hacking down a minimal bug case, I found the problem:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

I am attaching both the test script and input file; please review the
comments in the script. If the script dies with "aaaaaaaack!" then the bug
is present.

This bug has been reproduced with Perl 5.8.x built from development source.
Locale settings do not appear to affect it (happens with LANG=C, etc.).

--Daniel

--
NAME = Daniel Richard G. ## Remember, skunks _\|/_ meef?
EMAIL1 = skunk@iskunk.org ## don't smell bad--- (/o|o\) /
EMAIL2 = skunk@alum.mit.edu ## it's the people who < (^),>
WWW = http://www.******.org/ ## annoy them that do! / \
--
(****** = site not yet online)

p5pRT · 2005-12-21T03:22:46Z

From skunk@iskunk.org

bug.pl

p5pRT · 2005-12-21T03:22:46Z

From skunk@iskunk.org

Ã�-Wagen

p5pRT · 2005-12-21T10:36:40Z

From @rgs

Daniel Richard G.(via RT) wrote:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

This seems to be related to encoding.
Cargo-culting the following snippet from do_chomp() to do_chop() seems to
fix it. Tests running...

--- doop.c (rÉ¹vision 6377)
+++ doop.c (copie de travail)
@@ -967,6 +967,16 @@
if (SvREADONLY(sv))
Perl_croak(aTHX_ PL_no_modify);
}
+ if (PL_encoding) {
+ if (!SvUTF8(sv)) {
+ /* XXX, here sv is utf8-ized as a side-effect!
+ If encoding.pm is used properly, almost string-generating
+ operations, including literal strings, chr(), input data, etc.
+ should have been utf8-ized already, right?
+ */
+ sv_recode_to_utf8(sv, PL_encoding);
+ }
+ }
s = SvPV(sv, len);
if (len && !SvPOK(sv))
s = SvPV_force(sv, len);

p5pRT · 2005-12-21T10:36:42Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2005-12-21T11:13:41Z

From @rgs

Rafael Garcia-Suarez wrote:

Daniel Richard G.(via RT) wrote:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

This seems to be related to encoding.
Cargo-culting the following snippet from do_chomp() to do_chop() seems to
fix it. Tests running...

now commited as #26431. (with prettier comments)

--- doop.c (rÉ¹vision 6377)
+++ doop.c (copie de travail)

p5pRT · 2005-12-21T11:13:44Z

@rgs - Status changed from 'open' to 'resolved'

p5pRT · 2005-12-21T19:14:21Z

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.

However, if I don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again
stays as-is. (This is with source from perl-current.)

I'm not familiar with Perl's internals, but if lc() is failing due to
its argument not having been previously mirrored in a Perl-internal
UTF-8 representation... would it not make sense to have the
check-and-reencode bit at the top of lc()'s implementation (and in
other functions making use of encoding-dependent semantics), rather
than attempt to cover all possible origins of lc()'s argument?

(Quick question, btw: As a workaround for my scripts, is there a
concise way of bestowing internal-UTF-8-ness on a string without
otherwise modifying it?)

p5pRT · 2005-12-21T19:14:22Z

skunk@iskunk.org - Status changed from 'resolved' to 'open'

p5pRT · 2005-12-21T19:41:29Z

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.

However, if we don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again stays
as-is. (This is with source from perl-current.)

I'm not familiar with Perl's internals, but if lc() is failing due to its
argument not having been previously mirrored in a Perl-internal UTF-8
representation... would it not make sense to have the check-and-reencode
bit at the top of lc()'s implementation (and in other functions making use
of encoding-dependent semantics), rather than attempt to cover all possible
origins of lc()'s argument?

And a quick question: As a workaround for my scripts, is there a concise
way of bestowing internal-UTF8-ness on a string without otherwise modifying
it?

p5pRT · 2005-12-21T21:57:15Z

From @rgarcia

On 12/21/05, Daniel Richard G. <skunk@iskunk.org> wrote:

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.

However, if we don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again stays
as-is. (This is with source from perl-current.)

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

p5pRT · 2005-12-21T22:49:31Z

From skunk@iskunk.org

[rgarciasuarez@gmail.com - Wed Dec 21 13:57:15 2005]:

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

I can add "use locale" to the test script, set LANG=LC_ALL=LC_CTYPE=C,
and the behavior is the same as before. Either lc() is wrong to
lowercase the accented-U in that instance (assuming the C locale means
it shouldn't know how to handle non-ASCII characters), or this behavior
where chop/chomp affects lc()'s result on seemingly identical input is
wrong.

(For my part, I'd prefer to be able to use "no locale" and have lc()
behave according to Unicode semantics, than have to specify a locale
that matches Unicode semantics and worry about tainting, etc.)

p5pRT · 2010-03-29T23:19:20Z

From @khwilliamson

On Wed Dec 21 14:49:31 2005, skunk wrote:

[rgarciasuarez@gmail.com - Wed Dec 21 13:57:15 2005]:

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

I can add "use locale" to the test script, set LANG=LC_ALL=LC_CTYPE=C,
and the behavior is the same as before. Either lc() is wrong to
lowercase the accented-U in that instance (assuming the C locale means
it shouldn't know how to handle non-ASCII characters), or this behavior
where chop/chomp affects lc()'s result on seemingly identical input is
wrong.

(For my part, I'd prefer to be able to use "no locale" and have lc()
behave according to Unicode semantics, than have to specify a locale
that matches Unicode semantics and worry about tainting, etc.)

Perl 5.12 (unless glitches arise) will be released April 5, 2010. It is
adding the statement
use feature "unicode_strings";

This will cause lc() in the scope of the 'use' statement to behave as
you would hope on Latin1 characters. Therefore, I'm closing this ticket.
--Karl Williamson

p5pRT · 2010-03-29T23:19:54Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Mar 29, 2010

p5pRT added Severity High distro-Linux type-Unicode type-core labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lc() + Latin-1 chars is failing erratically #8253

lc() + Latin-1 chars is failing erratically #8253

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Mar 29, 2010

p5pRT commented Mar 29, 2010

lc() + Latin-1 chars is failing erratically #8253

lc() + Latin-1 chars is failing erratically #8253

Comments

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Dec 21, 2005

From @rgs

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

From @rgs

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Dec 21, 2005

From @rgarcia

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

p5pRT commented Mar 29, 2010

From @khwilliamson

p5pRT commented Mar 29, 2010