Problem with \W, [\W], and locales #8252

p5pRT · 2005-12-21T02:11:53Z

Migrated from rt.perl.org#37998 (status was 'resolved')

Searchable as RT37998$

p5pRT · 2005-12-21T02:11:54Z

From jfriedl@yahoo.com

This is a bug report for perl from jfriedl@yahoo.com,
generated with the help of perlbug 1.35 running under perl 5.9.3.

I've come across situations where \W and [\W] seem to match differently.

The test program is a bit complex because as far as I can tell, the
situation presents itself only when the data comes in via a file handle.
So, to keep things in one program, I fork a child and write to it.

The strangeness of the apparennt problem (along with the need for such a
convoluted test program) make me think strongly that it's just me doing
something stupid, but I can't for the life of me figure it out.

The test program looks at what \w [\w] \W [\W] considers a word character
in different locale situations. When considering codepoints from 0..255,
the answer is either "ASCII word characters" or "Latin1 word characters".

The program is appended.

With the current bleedperl (patch 26415), I get the following:
(with some notes I've added):

Perl version 5.009003

Using :encoding data-----------
No locale \W : Latin1 <-- different!
No locale [\W]: ASCII <-- different!

No locale \w : Latin1
No locale [\w]: Latin1

C locale \W : Latin1 <-- different!
C locale [\W]: ASCII <-- different!

C locale \w : Latin1
C locale [\w]: Latin1

Fr locale \W : Latin1 <-- different!
Fr locale [\W]: ASCII <-- different!

Fr locale \w : Latin1
Fr locale [\w]: Latin1

Using literal string-----------
No locale \W : ASCII
No locale [\W]: ASCII

No locale \w : ASCII
No locale [\w]: ASCII

C locale \W : ASCII
C locale [\W]: ASCII

C locale \w : ASCII
C locale [\w]: ASCII

Fr locale \W : ASCII
Fr locale [\W]: ASCII

Fr locale \w : ASCII
Fr locale [\w]: ASCII

So, regardless of the locale, \W and [\W] can match differently.

I also don't understand why in the top half, anything in the "no locale"
and "C locale" have anything to do with Latin1. Is this because once the
data gets into utf8, LC_CTYPE/locale becomes irrelevant?

In the 2nd half, where everything should be in the "native character set"
(I'm on a standard Debian Linux system with no locale defined), I'd expect
that setting the locale to fr_FR would have \w et. al. recognize latin1
stuff.

I've really been baning my head against this for a while, trying to isolate
the problem enough to send it in, so I hope someone can make sense of it.

Thanks,
Jeffrey

Jeffrey Friedl Kyoto, Japan http://regex.info/blog/

Flags:
category=core
severity=medium

Site configuration information for perl 5.9.3:

Configured by jfriedl at Tue Dec 20 05:44:24 PST 2005.

Summary of my perl5 (revision 5 version 9 subversion 3 patch 26415) configuration:
Platform:
osname=linux, osvers=2.4.19, archname=i686-linux
uname='linux pakupaku.regex.info 2.4.19 #9 smp sat dec 28 06:19:03 pst 2002 i686 gnulinux '
config_args='-Dusedevel -d -e -s -O -D optimize=-O2 -g'
hint=previous, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -g',
cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
ccversion='', gccversion='4.0.3 20051201 (prerelease) (Debian 4.0.2-5)', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.3.5'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
DEVEL24148

@INC for perl 5.9.3:
lib
/home/jfriedl/lib/perl
/usr/local/lib/perl5/5.9.3/i686-linux
/usr/local/lib/perl5/5.9.3
/usr/local/lib/perl5/site_perl/5.9.3/i686-linux
/usr/local/lib/perl5/site_perl/5.9.3
/usr/local/lib/perl5/site_perl
.

Environment for perl 5.9.3:
HOME=/home/jfriedl
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=.:/home/jfriedl/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
PERLLIB=/home/jfriedl/lib/perl
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2005-12-21T12:54:18Z

From BQW10602@nifty.com

On Tue, 20 Dec 2005 18:11:54 -0800, Jeffrey Friedl (via RT) wrote

The test program looks at what \w [\w] \W [\W] considers a word character
in different locale situations. When considering codepoints from 0..255,
the answer is either "ASCII word characters" or "Latin1 word characters".

Perl version 5\.009003

Using :encoding data\-\-\-\-\-\-\-\-\-\-\-
No locale  \\W : Latin1    \<\-\- different\!
No locale \[\\W\]&#8203;: ASCII     \<\-\- different\!

No locale  \\w : Latin1
No locale \[\\w\]&#8203;: Latin1

So, regardless of the locale, \W and [\W] can match differently.

perlunicode.pod ("Effects of Character Semantics") says:

(However, and as a limitation of the current implementation,
using \w or \W inside a [...] character class will still match
with byte semantics.)

cf. Bug #18281: UTF-8 bug: latin1 character both [\w] and [\W]

Thus \x80 to \xff are flagged as "utf8", all of them match [\W]
since they are non-word characters under byte semantics. :-<

I also don't understand why in the top half, anything in the "no locale"
and "C locale" have anything to do with Latin1. Is this because once the
data gets into utf8, LC_CTYPE/locale becomes irrelevant?

perlunicode.pod ("Locales") says:

Usually locale settings and Unicode do not affect each other,

Even if iswalnum() from the system supports characters beyond octet,
Perl's unicode support uses neither wchar_t, wint_t, nor wctype_t.

In the 2nd half, where everything should be in the "native character set"
(I'm on a standard Debian Linux system with no locale defined), I'd expect
that setting the locale to fr_FR would have \w et. al. recognize latin1
stuff.

Perl's "old" locale support (under bytes semantics) relies on system's
locale support. Perl itself doesn't keep definitions for locales.

perllocale.pod ("PREPARING TO USE LOCALES") says:

If Perl applications are to understand and present your data
correctly according a locale of your choice, all of the following
must be true:

- Your operating system must support the locale system.
- Definitions for locales that you use must be installed.
- Perl must believe that the locale system is supported.

Regards,
SADAHIRO Tomoyuki

p5pRT · 2005-12-21T12:54:23Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2005-12-23T09:58:11Z

From jfriedl@yahoo.com

Wow, I never imagined that it might be a known, documented issue.
Thanks for pulling that out, and sorry that I didn't do so myself.

Still, "byte semantics" doesn't mean "ASCII semantics", does it?

It seems that in the first set of tests, \w matches Latin-1 alphanumerics,
not ASCII alphanumerics, even when the locale is specifically set to "C".

In the second set of tests, \w had only ASCII semantics (again, even when
the locale is explicitly set to something like "fr_FR").

I have other tests (not included in the original report) in which \w
properly responds to the local locale, matching with either ASCII or Latin1
semantics, as appropriate to the locale I test with.

So, thanks Tomoyuki for solving the biggest mystery to me (diff betweenn
inside and outside of classes), but I'm still confused as to when \w will
and won't honor the locale....

Jeffrey

Jeffrey Friedl Kyoto, Japan http://regex.info/blog/

p5pRT · 2005-12-25T08:02:37Z

From BQW10602@nifty.com

On Thu, 22 Dec 2005 17:41:20 -0800 (PST), jfriedl@regex.info (Jeffrey Friedl) wrote

Still, "byte semantics" doesn't mean "ASCII semantics", does it?

No. We have byte semantics and character semantics,
(cf. http://perldoc.perl.org/perlunicode.html#Byte-and-Character-Semantics)
but neither ASCII semantics nor Latin1 semantics.

The byte semantics stands for compatibility behavior of older perl
(for example perl 5.005_03) that lacks "Unicode surport".

Under byte semantics \w and \W work differently from those for ASCII,
when
- a locale (other than "C") is set.
- perl runs on EBCDIC platforms.

It seems that in the first set of tests, \w matches Latin-1 alphanumerics,
not ASCII alphanumerics, even when the locale is specifically set to "C".

In the second set of tests, \w had only ASCII semantics (again, even when
the locale is explicitly set to something like "fr_FR").

It seems that the first set of tests (Using :encoding data)
corresponds to character semantics and the second set of tests
(Using literal string) corresponds to byte semantics.

In both cases, if locales are set properly, \w and \W should work
differently from the case of no locale.

I have other tests (not included in the original report) in which \w
properly responds to the local locale, matching with either ASCII or Latin1
semantics, as appropriate to the locale I test with.

So, thanks Tomoyuki for solving the biggest mystery to me (diff betweenn
inside and outside of classes), but I'm still confused as to when \w will
and won't honor the locale....
Jeffrey
--------------------------------------------------------------------------
Jeffrey Friedl Kyoto, Japan http://regex.info/blog/

Regards,
SADAHIRO Tomoyuki

p5pRT · 2012-01-16T18:11:12Z

From @khwilliamson

This probably should have been resolved a long time ago. Some of the
bugs using [\w] have been long ago fixed; and all the issues regarding
\w and \W matching differently were finished off in 5.12

--
Karl Williamson

p5pRT · 2012-01-16T18:11:13Z

From [Unknown Contact. See original ticket]

This probably should have been resolved a long time ago. Some of the
bugs using [\w] have been long ago fixed; and all the issues regarding
\w and \W matching differently were finished off in 5.12

--
Karl Williamson

p5pRT · 2012-01-16T18:11:13Z

@khwilliamson - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Jan 16, 2012

p5pRT added the Severity Low label Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with \W, [\W], and locales #8252

Problem with \W, [\W], and locales #8252

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

p5pRT commented Dec 23, 2005

p5pRT commented Dec 25, 2005

p5pRT commented Jan 16, 2012

p5pRT commented Jan 16, 2012

p5pRT commented Jan 16, 2012

Problem with \W, [\W], and locales #8252

Problem with \W, [\W], and locales #8252

Comments

p5pRT commented Dec 21, 2005

p5pRT commented Dec 21, 2005

From jfriedl@yahoo.com

p5pRT commented Dec 21, 2005

From BQW10602@nifty.com

p5pRT commented Dec 21, 2005

p5pRT commented Dec 23, 2005

From jfriedl@yahoo.com

p5pRT commented Dec 25, 2005

From BQW10602@nifty.com

p5pRT commented Jan 16, 2012

From @khwilliamson

p5pRT commented Jan 16, 2012

From [Unknown Contact. See original ticket]

p5pRT commented Jan 16, 2012