Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with \W, [\W], and locales #8252

Closed
p5pRT opened this issue Dec 21, 2005 · 8 comments
Closed

Problem with \W, [\W], and locales #8252

p5pRT opened this issue Dec 21, 2005 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 21, 2005

Migrated from rt.perl.org#37998 (status was 'resolved')

Searchable as RT37998$

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From jfriedl@yahoo.com

This is a bug report for perl from jfriedl@​yahoo.com,
generated with the help of perlbug 1.35 running under perl 5.9.3.

I've come across situations where \W and [\W] seem to match differently.

The test program is a bit complex because as far as I can tell, the
situation presents itself only when the data comes in via a file handle.
So, to keep things in one program, I fork a child and write to it.

The strangeness of the apparennt problem (along with the need for such a
convoluted test program) make me think strongly that it's just me doing
something stupid, but I can't for the life of me figure it out.

The test program looks at what \w [\w] \W [\W] considers a word character
in different locale situations. When considering codepoints from 0..255,
the answer is either "ASCII word characters" or "Latin1 word characters".

The program is appended.

With the current bleedperl (patch 26415), I get the following​:
(with some notes I've added)​:

  Perl version 5.009003

  Using :encoding data-----------
  No locale \W : Latin1 <-- different!
  No locale [\W]​: ASCII <-- different!

  No locale \w : Latin1
  No locale [\w]​: Latin1

  C locale \W : Latin1 <-- different!
  C locale [\W]​: ASCII <-- different!

  C locale \w : Latin1
  C locale [\w]​: Latin1

  Fr locale \W : Latin1 <-- different!
  Fr locale [\W]​: ASCII <-- different!

  Fr locale \w : Latin1
  Fr locale [\w]​: Latin1

  Using literal string-----------
  No locale \W : ASCII
  No locale [\W]​: ASCII

  No locale \w : ASCII
  No locale [\w]​: ASCII

  C locale \W : ASCII
  C locale [\W]​: ASCII

  C locale \w : ASCII
  C locale [\w]​: ASCII

  Fr locale \W : ASCII
  Fr locale [\W]​: ASCII

  Fr locale \w : ASCII
  Fr locale [\w]​: ASCII

So, regardless of the locale, \W and [\W] can match differently.

I also don't understand why in the top half, anything in the "no locale"
and "C locale" have anything to do with Latin1. Is this because once the
data gets into utf8, LC_CTYPE/locale becomes irrelevant?

In the 2nd half, where everything should be in the "native character set"
(I'm on a standard Debian Linux system with no locale defined), I'd expect
that setting the locale to fr_FR would have \w et. al. recognize latin1
stuff.

I've really been baning my head against this for a while, trying to isolate
the problem enough to send it in, so I hope someone can make sense of it.

Thanks,
  Jeffrey


Jeffrey Friedl Kyoto, Japan http​://regex.info/blog/


Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.9.3​:

Configured by jfriedl at Tue Dec 20 05​:44​:24 PST 2005.

Summary of my perl5 (revision 5 version 9 subversion 3 patch 26415) configuration​:
  Platform​:
  osname=linux, osvers=2.4.19, archname=i686-linux
  uname='linux pakupaku.regex.info 2.4.19 #9 smp sat dec 28 06​:19​:03 pst 2002 i686 gnulinux '
  config_args='-Dusedevel -d -e -s -O -D optimize=-O2 -g'
  hint=previous, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
  ccversion='', gccversion='4.0.3 20051201 (prerelease) (Debian 4.0.2-5)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.3.5'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:
  DEVEL24148


@​INC for perl 5.9.3​:
  lib
  /home/jfriedl/lib/perl
  /usr/local/lib/perl5/5.9.3/i686-linux
  /usr/local/lib/perl5/5.9.3
  /usr/local/lib/perl5/site_perl/5.9.3/i686-linux
  /usr/local/lib/perl5/site_perl/5.9.3
  /usr/local/lib/perl5/site_perl
  .


Environment for perl 5.9.3​:
  HOME=/home/jfriedl
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=.​:/home/jfriedl/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/bin/X11​:/usr/games
  PERLLIB=/home/jfriedl/lib/perl
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

From BQW10602@nifty.com

On Tue, 20 Dec 2005 18​:11​:54 -0800, Jeffrey Friedl (via RT) wrote

The test program looks at what \w [\w] \W [\W] considers a word character
in different locale situations. When considering codepoints from 0..255,
the answer is either "ASCII word characters" or "Latin1 word characters".

Perl version 5\.009003

Using :encoding data\-\-\-\-\-\-\-\-\-\-\-
No locale  \\W : Latin1    \<\-\- different\!
No locale \[\\W\]&#8203;: ASCII     \<\-\- different\!

No locale  \\w : Latin1
No locale \[\\w\]&#8203;: Latin1

So, regardless of the locale, \W and [\W] can match differently.

perlunicode.pod ("Effects of Character Semantics") says​:

  (However, and as a limitation of the current implementation,
  using \w or \W inside a [...] character class will still match
  with byte semantics.)

cf. Bug #18281​: UTF-8 bug​: latin1 character both [\w] and [\W]

Thus \x80 to \xff are flagged as "utf8", all of them match [\W]
since they are non-word characters under byte semantics. :-<

I also don't understand why in the top half, anything in the "no locale"
and "C locale" have anything to do with Latin1. Is this because once the
data gets into utf8, LC_CTYPE/locale becomes irrelevant?

perlunicode.pod ("Locales") says​:

  Usually locale settings and Unicode do not affect each other,

Even if iswalnum() from the system supports characters beyond octet,
Perl's unicode support uses neither wchar_t, wint_t, nor wctype_t.

In the 2nd half, where everything should be in the "native character set"
(I'm on a standard Debian Linux system with no locale defined), I'd expect
that setting the locale to fr_FR would have \w et. al. recognize latin1
stuff.

Perl's "old" locale support (under bytes semantics) relies on system's
locale support. Perl itself doesn't keep definitions for locales.

perllocale.pod ("PREPARING TO USE LOCALES") says​:

  If Perl applications are to understand and present your data
  correctly according a locale of your choice, all of the following
  must be true​:

  - Your operating system must support the locale system.
  - Definitions for locales that you use must be installed.
  - Perl must believe that the locale system is supported.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2005

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 23, 2005

From jfriedl@yahoo.com

SADAHIRO Tomoyuki via RT <perlbug-followup@​perl.org> wrote​:
|> > So, regardless of the locale, \W and [\W] can match differently.
|>
|> perlunicode.pod ("Effects of Character Semantics") says​:
|>
|> (However, and as a limitation of the current implementation,
|> using \w or \W inside a [...] character class will still match
|> with byte semantics.)
|>
|> cf. Bug #18281​: UTF-8 bug​: latin1 character both [\w] and [\W]

Wow, I never imagined that it might be a known, documented issue.
Thanks for pulling that out, and sorry that I didn't do so myself.

Still, "byte semantics" doesn't mean "ASCII semantics", does it?

It seems that in the first set of tests, \w matches Latin-1 alphanumerics,
not ASCII alphanumerics, even when the locale is specifically set to "C".

In the second set of tests, \w had only ASCII semantics (again, even when
the locale is explicitly set to something like "fr_FR").

I have other tests (not included in the original report) in which \w
properly responds to the local locale, matching with either ASCII or Latin1
semantics, as appropriate to the locale I test with.

So, thanks Tomoyuki for solving the biggest mystery to me (diff betweenn
inside and outside of classes), but I'm still confused as to when \w will
and won't honor the locale....

  Jeffrey


Jeffrey Friedl Kyoto, Japan http​://regex.info/blog/

@p5pRT
Copy link
Author

p5pRT commented Dec 25, 2005

From BQW10602@nifty.com

On Thu, 22 Dec 2005 17​:41​:20 -0800 (PST), jfriedl@​regex.info (Jeffrey Friedl) wrote

Still, "byte semantics" doesn't mean "ASCII semantics", does it?

No. We have byte semantics and character semantics,
  (cf. http​://perldoc.perl.org/perlunicode.html#Byte-and-Character-Semantics)
  but neither ASCII semantics nor Latin1 semantics.

The byte semantics stands for compatibility behavior of older perl
(for example perl 5.005_03) that lacks "Unicode surport".

Under byte semantics \w and \W work differently from those for ASCII,
when
  - a locale (other than "C") is set.
  - perl runs on EBCDIC platforms.

It seems that in the first set of tests, \w matches Latin-1 alphanumerics,
not ASCII alphanumerics, even when the locale is specifically set to "C".

In the second set of tests, \w had only ASCII semantics (again, even when
the locale is explicitly set to something like "fr_FR").

It seems that the first set of tests (Using :encoding data)
corresponds to character semantics and the second set of tests
(Using literal string) corresponds to byte semantics.

In both cases, if locales are set properly, \w and \W should work
differently from the case of no locale.

I have other tests (not included in the original report) in which \w
properly responds to the local locale, matching with either ASCII or Latin1
semantics, as appropriate to the locale I test with.

So, thanks Tomoyuki for solving the biggest mystery to me (diff betweenn
inside and outside of classes), but I'm still confused as to when \w will
and won't honor the locale....

Jeffrey

--------------------------------------------------------------------------
Jeffrey Friedl Kyoto, Japan http​://regex.info/blog/

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Jan 16, 2012

From @khwilliamson

This probably should have been resolved a long time ago. Some of the
bugs using [\w] have been long ago fixed; and all the issues regarding
\w and \W matching differently were finished off in 5.12

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jan 16, 2012

From [Unknown Contact. See original ticket]

This probably should have been resolved a long time ago. Some of the
bugs using [\w] have been long ago fixed; and all the issues regarding
\w and \W matching differently were finished off in 5.12

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jan 16, 2012

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant