Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lc() + Latin-1 chars is failing erratically #8253

Closed
p5pRT opened this issue Dec 21, 2005 · 14 comments
Closed

lc() + Latin-1 chars is failing erratically #8253

p5pRT opened this issue Dec 21, 2005 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 21, 2005

Migrated from rt.perl.org#37999 (status was 'resolved')

Searchable as RT37999$

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

I have a script that is processing a list of words in Latin-1 encoding.
It is taking one word from each line, lowercasing it, and writing it
out.

I had found that certain accented letters in a word were not being
lowercased by lc(), even though other (ASCII) letters in the same word
were. At first I thought that an encoding issue was to blame, but after
hacking down a minimal bug case, I found the problem​:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

I am attaching both the test script and input file; please review the
comments in the script. If the script dies with "aaaaaaaack!" then the bug
is present.

This bug has been reproduced with Perl 5.8.x built from development source.
Locale settings do not appear to affect it (happens with LANG=C, etc.).

--Daniel

--
NAME = Daniel Richard G. ## Remember, skunks _\|/_ meef?
EMAIL1 = skunk@​iskunk.org ## don't smell bad--- (/o|o\) /
EMAIL2 = skunk@​alum.mit.edu ## it's the people who < (^),>
WWW = http​://www.******.org/ ## annoy them that do! / \
--
(****** = site not yet online)

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

bug.pl

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

�-Wagen

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From @rgs

Daniel Richard G.(via RT) wrote​:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

This seems to be related to encoding.
Cargo-culting the following snippet from do_chomp() to do_chop() seems to
fix it. Tests running...

--- doop.c (rɹvision 6377)
+++ doop.c (copie de travail)
@​@​ -967,6 +967,16 @​@​
  if (SvREADONLY(sv))
  Perl_croak(aTHX_ PL_no_modify);
  }
+ if (PL_encoding) {
+ if (!SvUTF8(sv)) {
+ /* XXX, here sv is utf8-ized as a side-effect!
+ If encoding.pm is used properly, almost string-generating
+ operations, including literal strings, chr(), input data, etc.
+ should have been utf8-ized already, right?
+ */
+ sv_recode_to_utf8(sv, PL_encoding);
+ }
+ }
  s = SvPV(sv, len);
  if (len && !SvPOK(sv))
  s = SvPV_force(sv, len);

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From @rgs

Rafael Garcia-Suarez wrote​:

Daniel Richard G.(via RT) wrote​:

If I chomp() the string before lc()ing it, everything works fine. If I
chop() it first---even though the resulting string is identical---the case
transformation fails. (Same result if I do neither, retaining the trailing
"\n".) Also, if I don't read the input from a file, but merely place it
inline in the program, everything works (with chomp() and chop() alike).

This seems to be related to encoding.
Cargo-culting the following snippet from do_chomp() to do_chop() seems to
fix it. Tests running...

now commited as #26431. (with prettier comments)

--- doop.c (rɹvision 6377)
+++ doop.c (copie de travail)

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

@rgs - Status changed from 'open' to 'resolved'

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.
 
However, if I don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again
stays as-is. (This is with source from perl-current.)
 
I'm not familiar with Perl's internals, but if lc() is failing due to
its argument not having been previously mirrored in a Perl-internal
UTF-8 representation... would it not make sense to have the
check-and-reencode bit at the top of lc()'s implementation (and in
other functions making use of encoding-dependent semantics), rather
than attempt to cover all possible origins of lc()'s argument?
 
(Quick question, btw​: As a workaround for my scripts, is there a
concise way of bestowing internal-UTF-8-ness on a string without
otherwise modifying it?)

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

skunk@iskunk.org - Status changed from 'resolved' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.

However, if we don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again stays
as-is. (This is with source from perl-current.)

I'm not familiar with Perl's internals, but if lc() is failing due to its
argument not having been previously mirrored in a Perl-internal UTF-8
representation... would it not make sense to have the check-and-reencode
bit at the top of lc()'s implementation (and in other functions making use
of encoding-dependent semantics), rather than attempt to cover all possible
origins of lc()'s argument?

And a quick question​: As a workaround for my scripts, is there a concise
way of bestowing internal-UTF8-ness on a string without otherwise modifying
it?

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From @rgarcia

On 12/21/05, Daniel Richard G. <skunk@​iskunk.org> wrote​:

I've confirmed that the bug no longer occurs when using chop(), or even
s/\n// in the same place.

However, if we don't modify the string (no chomp/chop/etc.), remove the
"eq" check, and lc() it with newline and all, the accented U again stays
as-is. (This is with source from perl-current.)

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From skunk@iskunk.org

[rgarciasuarez@​gmail.com - Wed Dec 21 13​:57​:15 2005]​:

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

I can add "use locale" to the test script, set LANG=LC_ALL=LC_CTYPE=C,
and the behavior is the same as before. Either lc() is wrong to
lowercase the accented-U in that instance (assuming the C locale means
it shouldn't know how to handle non-ASCII characters), or this behavior
where chop/chomp affects lc()'s result on seemingly identical input is
wrong.

(For my part, I'd prefer to be able to use "no locale" and have lc()
behave according to Unicode semantics, than have to specify a locale
that matches Unicode semantics and worry about tainting, etc.)

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2010

From @khwilliamson

On Wed Dec 21 14​:49​:31 2005, skunk wrote​:

[rgarciasuarez@​gmail.com - Wed Dec 21 13​:57​:15 2005]​:

Well, my understanding is that it's the documented behaviour if you
don't use locale. (see perldoc locale)

I can add "use locale" to the test script, set LANG=LC_ALL=LC_CTYPE=C,
and the behavior is the same as before. Either lc() is wrong to
lowercase the accented-U in that instance (assuming the C locale means
it shouldn't know how to handle non-ASCII characters), or this behavior
where chop/chomp affects lc()'s result on seemingly identical input is
wrong.

(For my part, I'd prefer to be able to use "no locale" and have lc()
behave according to Unicode semantics, than have to specify a locale
that matches Unicode semantics and worry about tainting, etc.)

Perl 5.12 (unless glitches arise) will be released April 5, 2010. It is
adding the statement
use feature "unicode_strings";

This will cause lc() in the scope of the 'use' statement to behave as
you would hope on Latin1 characters. Therefore, I'm closing this ticket.
--Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Mar 29, 2010

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant