Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSIX::mblen() broken on threaded perls with glibc #16887

Closed
p5pRT opened this issue Mar 13, 2019 · 32 comments
Closed

POSIX::mblen() broken on threaded perls with glibc #16887

p5pRT opened this issue Mar 13, 2019 · 32 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 13, 2019

Migrated from rt.perl.org#133928 (status was 'open')

Searchable as RT133928$

@p5pRT
Copy link
Author

p5pRT commented Mar 13, 2019

From @ntyni

This is a bug report for perl from ntyni@​debian.org,
generated with the help of perlbug 1.41 running under perl 5.29.9.


As reported in https://bugs.launchpad.net/bugs/1818953 POSIX​::mblen()
is broken on threaded perls with glibc.

  % perl -MPOSIX=mblen -e 'mblen("a", 1)'
  perl​: mbrtowc.c​:105​: __mbrtowc​: Assertion `__mbsinit (data.__statep)' failed.
  zsh​: abort (core dumped) perl -MPOSIX=mblen -e 'mblen("a", 1)'

This broke in v5.27.8-134-g6c9ff7e96e which made the function
use mbrlen(3) under the hood on threaded perls.

The problem is initialization of the shift state with

  mbrlen(NULL, 0, &ps));

The glibc documentation for mbrlen(3) at

  https://www.gnu.org/software/libc/manual/html_node/Converting-a-Character.html#Converting-a-Character

does not mention initialization by passing in a null pointer for the
string, only a pointer to a NUL wide character.

  If the next multibyte character corresponds to the NUL wide character,
  the return value is 0. If the next n bytes form a valid multibyte
  character, the number of bytes belonging to this multibyte character
  byte sequence is returned.

The attached proposed patch uses memset(3) instead for mbstate_t initialization,
as suggested in

  https://www.gnu.org/software/libc/manual/html_node/Keeping-the-state.html

with the hope that this is more portable.

The patch also adds a few basic test cases. These are in a new file because
they need fresh_perl_is() from test.pl while the existing ones use
Test​::More (and conversion of at least posix.t looks way too involved.)



Flags​:
  category=library
  severity=medium
  module=POSIX


Site configuration information for perl 5.29.9​:

Configured by niko at Wed Mar 13 21​:49​:26 EET 2019.

Summary of my perl5 (revision 5 version 29 subversion 9) configuration​:
  Local Commit​: aaf1159fe2b891f63f819b2d496b0f938456a36d
  Ancestor​: 823c3b2
  Platform​:
  osname=linux
  osvers=4.19.0-2-amd64
  archname=x86_64-linux-thread-multi
  uname='linux estella 4.19.0-2-amd64 #1 smp debian 4.19.16-1 (2019-01-17) x86_64 gnulinux '
  config_args='-des -Dusedevel -Dusethreads'
  hint=recommended
  useposix=true
  d_sigaction=define
  useithreads=define
  usemultiplicity=define
  use64bitint=define
  use64bitall=define
  uselongdouble=undef
  usemymalloc=n
  default_inc_excludes_dot=define
  bincompat5005=undef
  Compiler​:
  cc='cc'
  ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
  optimize='-O2'
  cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
  ccversion=''
  gccversion='8.2.0'
  gccosandvers=''
  intsize=4
  longsize=8
  ptrsize=8
  doublesize=8
  byteorder=12345678
  doublekind=3
  d_longlong=define
  longlongsize=8
  d_longdbl=define
  longdblsize=16
  longdblkind=3
  ivtype='long'
  ivsize=8
  nvtype='double'
  nvsize=8
  Off_t='off_t'
  lseeksize=8
  alignbytes=8
  prototype=define
  Linker and Libraries​:
  ld='cc'
  ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
  perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  libc=libc-2.28.so
  so=so
  useshrplib=false
  libperl=libperl.a
  gnulibc_version='2.28'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs
  dlext=so
  d_dlsymun=undef
  ccdlflags='-Wl,-E'
  cccdlflags='-fPIC'
  lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'

Locally applied patches​:
  aaf1159fe2b891f63f819b2d496b0f938456a36d


@​INC for perl 5.29.9​:
  lib
  /usr/local/lib/perl5/site_perl/5.29.9/x86_64-linux-thread-multi
  /usr/local/lib/perl5/site_perl/5.29.9
  /usr/local/lib/perl5/5.29.9/x86_64-linux-thread-multi
  /usr/local/lib/perl5/5.29.9


Environment for perl 5.29.9​:
  HOME=/home/niko
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LC_CTYPE=fi_FI.UTF-8
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/niko/bin​:/home/niko/bin​:/home/niko/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/local/games​:/usr/games​:/sbin​:/usr/sbin​:/sbin​:/usr/sbin
  PERL_BADLANG (unset)
  SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Mar 13, 2019

From @ntyni

0001-Fix-POSIX-mblen-mbstate_t-initialization-on-threaded.patch
From aaf1159fe2b891f63f819b2d496b0f938456a36d Mon Sep 17 00:00:00 2001
From: Niko Tyni <ntyni@debian.org>
Date: Sun, 10 Mar 2019 19:40:42 +0200
Subject: [PATCH] Fix POSIX::mblen mbstate_t initialization on threaded perls
 with glibc

As reported in https://bugs.launchpad.net/bugs/1818953 POSIX::mblen()
is broken on threaded perls with glibc.

  % perl -MPOSIX=mblen -e 'mblen("a", 1)'
  perl: mbrtowc.c:105: __mbrtowc: Assertion `__mbsinit (data.__statep)' failed.
  zsh: abort (core dumped)  perl -MPOSIX=mblen -e 'mblen("a", 1)'

This broke in v5.27.8-134-g6c9ff7e96e which made the function
use mbrlen(3) under the hood on threaded perls.

The problem is initialization of the shift state with

  mbrlen(NULL, 0, &ps));

The glibc documentation for mbrlen(3) at

  https://www.gnu.org/software/libc/manual/html_node/Converting-a-Character.html#Converting-a-Character

does not mention initialization by passing in a null pointer for the
string, only a pointer to a NUL wide character.

   If the next multibyte character corresponds to the NUL wide character,
   the return value is 0. If the next n bytes form a valid multibyte
   character, the number of bytes belonging to this multibyte character
   byte sequence is returned.

Use memset(3) instead for mbstate_t initialization, as suggested in

  https://www.gnu.org/software/libc/manual/html_node/Keeping-the-state.html

with the hope that this is more portable.

While at it, add a few basic test cases. These are in a new file because
they need fresh_perl_is() from test.pl while the existing ones use
Test::More (and conversion of at least posix.t looks way too involved.)

Bug-Ubuntu: https://bugs.launchpad.net/bugs/1818953
---
 MANIFEST               |  1 +
 ext/POSIX/POSIX.xs     |  2 +-
 ext/POSIX/lib/POSIX.pm |  2 +-
 ext/POSIX/t/mb.t       | 45 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 2 deletions(-)
 create mode 100644 ext/POSIX/t/mb.t

diff --git a/MANIFEST b/MANIFEST
index 4cf40a8eec..57465859b9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -4182,6 +4182,7 @@ ext/POSIX/POSIX.xs		POSIX extension external subroutines
 ext/POSIX/t/export.t		Test @EXPORT and @EXPORT_OK
 ext/POSIX/t/iscrash		See if POSIX isxxx() crashes with threads on Win32
 ext/POSIX/t/math.t		Basic math tests for POSIX
+ext/POSIX/t/mb.t		Multibyte function tests for POSIX
 ext/POSIX/t/posix.t		See if POSIX works
 ext/POSIX/t/sigaction.t		See if POSIX::sigaction works
 ext/POSIX/t/sigset.t		See if POSIX::SigSet works
diff --git a/ext/POSIX/POSIX.xs b/ext/POSIX/POSIX.xs
index 1ebc358af4..27051c1fdb 100644
--- a/ext/POSIX/POSIX.xs
+++ b/ext/POSIX/POSIX.xs
@@ -3329,7 +3329,7 @@ mblen(s, n)
 #endif
     CODE:
 #if defined(USE_ITHREADS) && defined(HAS_MBRLEN)
-        PERL_UNUSED_RESULT(mbrlen(NULL, 0, &ps));   /* Initialize state */
+        memset(&ps, 0, sizeof(ps)); /* Initialize state */
         RETVAL = mbrlen(s, n, &ps); /* Prefer reentrant version */
 #else
         RETVAL = mblen(s, n);
diff --git a/ext/POSIX/lib/POSIX.pm b/ext/POSIX/lib/POSIX.pm
index 25392be4b1..4de039410f 100644
--- a/ext/POSIX/lib/POSIX.pm
+++ b/ext/POSIX/lib/POSIX.pm
@@ -4,7 +4,7 @@ use warnings;
 
 our ($AUTOLOAD, %SIGRT);
 
-our $VERSION = '1.87';
+our $VERSION = '1.88';
 
 require XSLoader;
 
diff --git a/ext/POSIX/t/mb.t b/ext/POSIX/t/mb.t
new file mode 100644
index 0000000000..4c60c22bae
--- /dev/null
+++ b/ext/POSIX/t/mb.t
@@ -0,0 +1,45 @@
+#!./perl
+
+# These tests are in a separate file, because they use fresh_perl_is()
+# from test.pl.
+
+# The mb* functions use the "underlying locale" that is not affected by
+# the Perl one.  So we run the tests in a separate "fresh_perl" process
+# with the correct LC_CTYPE set in the environment.
+
+BEGIN {
+    require Config; import Config;
+    if ($^O ne 'VMS' and $Config{'extensions'} !~ /\bPOSIX\b/) {
+	print "1..0\n";
+	exit 0;
+    }
+    unshift @INC, "../../t";
+    require 'loc_tools.pl';
+    require 'test.pl';
+}
+
+plan tests => 3;
+
+use POSIX qw();
+
+SKIP: {
+    skip("mblen() not present", 3) unless $Config{d_mblen};
+
+    is(&POSIX::mblen("a", &POSIX::MB_CUR_MAX), 1, 'mblen() basically works');
+
+    skip("LC_CTYPE locale support not available", 2)
+      unless locales_enabled('LC_CTYPE');
+
+    my $utf8_locale = find_utf8_ctype_locale();
+    skip("no utf8 locale available", 2) unless $utf8_locale;
+
+    local $ENV{LC_CTYPE} = $utf8_locale;
+
+    fresh_perl_is(
+      'use POSIX; print &POSIX::mblen("\x{c3}\x{28}", &POSIX::MB_CUR_MAX)',
+      -1, {}, 'mblen() recognizes invalid multibyte characters');
+
+    fresh_perl_is(
+     'use POSIX; print &POSIX::mblen("\N{GREEK SMALL LETTER SIGMA}", &POSIX::MB_CUR_MAX)',
+     2, {}, 'mblen() works on UTF-8 characters');
+}
-- 
2.20.1

@p5pRT
Copy link
Author

p5pRT commented Mar 13, 2019

From @jkeenan

On Wed, 13 Mar 2019 20​:49​:30 GMT, ntyni@​debian.org wrote​:

This is a bug report for perl from ntyni@​debian.org,
generated with the help of perlbug 1.41 running under perl 5.29.9.

-----------------------------------------------------------------

As reported in https://bugs.launchpad.net/bugs/1818953 POSIX​::mblen()
is broken on threaded perls with glibc.

% perl -MPOSIX=mblen -e 'mblen("a", 1)'
perl​: mbrtowc.c​:105​: __mbrtowc​: Assertion `__mbsinit (data.__statep)'
failed.
zsh​: abort (core dumped) perl -MPOSIX=mblen -e 'mblen("a", 1)'

This broke in v5.27.8-134-g6c9ff7e96e which made the function
use mbrlen(3) under the hood on threaded perls.

The problem is initialization of the shift state with

mbrlen(NULL, 0, &ps));

The glibc documentation for mbrlen(3) at

https://www.gnu.org/software/libc/manual/html_node/Converting-a-
Character.html#Converting-a-Character

does not mention initialization by passing in a null pointer for the
string, only a pointer to a NUL wide character.

If the next multibyte character corresponds to the NUL wide character,
the return value is 0. If the next n bytes form a valid multibyte
character, the number of bytes belonging to this multibyte character
byte sequence is returned.

The attached proposed patch uses memset(3) instead for mbstate_t
initialization,
as suggested in

https://www.gnu.org/software/libc/manual/html_node/Keeping-the-
state.html

with the hope that this is more portable.

The patch also adds a few basic test cases. These are in a new file
because
they need fresh_perl_is() from test.pl while the existing ones use
Test​::More (and conversion of at least posix.t looks way too
involved.)

Made available for smoke-testing in this branch​:

smoke-me/jkeenan/ntyni/133928-posix

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 13, 2019

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2019

From @xenu

On Wed, 13 Mar 2019 13​:49​:31 -0700
"Niko Tyni \(via RT\)" <perlbug-followup@​perl.org> wrote​:

As reported in https://bugs.launchpad.net/bugs/1818953 POSIX​::mblen()
is broken on threaded perls with glibc.

% perl -MPOSIX=mblen -e 'mblen("a", 1)'
perl​: mbrtowc.c​:105​: __mbrtowc​: Assertion `__mbsinit (data.__statep)' failed.
zsh​: abort (core dumped) perl -MPOSIX=mblen -e 'mblen("a", 1)'

This broke in v5.27.8-134-g6c9ff7e96e which made the function
use mbrlen(3) under the hood on threaded perls.

The problem is initialization of the shift state with

mbrlen(NULL, 0, &ps));

The glibc documentation for mbrlen(3) at

https://www.gnu.org/software/libc/manual/html_node/Converting-a-Character.html#Converting-a-Character

does not mention initialization by passing in a null pointer for the
string, only a pointer to a NUL wide character.

If the next multibyte character corresponds to the NUL wide character,
the return value is 0. If the next n bytes form a valid multibyte
character, the number of bytes belonging to this multibyte character
byte sequence is returned.

The attached proposed patch uses memset(3) instead for mbstate_t initialization,
as suggested in

https://www.gnu.org/software/libc/manual/html_node/Keeping-the-state.html

with the hope that this is more portable.

The patch also adds a few basic test cases. These are in a new file because
they need fresh_perl_is() from test.pl while the existing ones use
Test​::More (and conversion of at least posix.t looks way too involved.)

BTW, it seems that we are *not* resetting the shift state when mblen()
is being used. It means that if you're using some exotic east-asian locale
with encoding like ISO-2022-JP, the user-visible behaviour of
POSIX​::mblen() changes depending whether it was implemented using mblen()
or mbrlen().

ISO-2022-* encodings have special escape sequences which change the
meaning of following characters. For example, in ISO-2022-JP "\e(B"
means "interpret the following bytes as ASCII", while "\e\$@​" changes
the encoding to JIS X 0208-1978.

The state changes caused by those escape sequences are global and
preserved between mblen() calls. The state can be reset using mblen(NULL,
0). That stupid misfeature of ISO-2022-* encodings is the sole reason
why mb(r)len API is so insane.

It's mostly a theoretical problem and it's practically untestable,
because, while ISO-2022-* encodings are pretty widespread, they were
(almost?) never used for system locales.

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2019

From @jkeenan

On Wed, 13 Mar 2019 21​:27​:49 GMT, jkeenan wrote​:

On Wed, 13 Mar 2019 20​:49​:30 GMT, ntyni@​debian.org wrote​:

This is a bug report for perl from ntyni@​debian.org,
generated with the help of perlbug 1.41 running under perl 5.29.9.

[snip]

Made available for smoke-testing in this branch​:

smoke-me/jkeenan/ntyni/133928-posix

Niko, we're getting lots of smoke-test failures in this branch. See http​://perl.develop-help.com/?b=smoke-me%2Fjkeenan%2Fntyni%2F133928-posix

Specifically, failing on Linux when using g++. Failing on NetBSD when using either gcc or g++.

Can you investigate?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2019

From @ntyni

On Thu, Mar 14, 2019 at 05​:07​:27AM -0700, James E Keenan via RT wrote​:

On Wed, 13 Mar 2019 21​:27​:49 GMT, jkeenan wrote​:

Made available for smoke-testing in this branch​:

smoke-me/jkeenan/ntyni/133928-posix

Niko, we're getting lots of smoke-test failures in this branch. See http​://perl.develop-help.com/?b=smoke-me%2Fjkeenan%2Fntyni%2F133928-posix

Specifically, failing on Linux when using g++. Failing on NetBSD when using either gcc or g++.

Seems to be failing on pretty much all BSD variants.

However, I just tried and it works fine for me on FreeBSD 12. And also
on Linux with g++.

The best I can think of is that all the failing ones have set LC_ALL=C
or something like that, which seems a bit improbable. Nevertheless,
here's an updated patch that shouldn't break in the presence of LC_ALL.

Could we please see if this fares any better?

Thanks,
--
Niko

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2019

From @ntyni

0001-Fix-POSIX-mblen-mbstate_t-initialization-on-threaded.patch
From 0c09c63e43008a8ec56d8b9fc428d694fb1e2ac4 Mon Sep 17 00:00:00 2001
From: Niko Tyni <ntyni@debian.org>
Date: Sun, 10 Mar 2019 19:40:42 +0200
Subject: [PATCH] Fix POSIX::mblen mbstate_t initialization on threaded perls
 with glibc

As reported in https://bugs.launchpad.net/bugs/1818953 POSIX::mblen()
is broken on threaded perls with glibc.

  % perl -MPOSIX=mblen -e 'mblen("a", 1)'
  perl: mbrtowc.c:105: __mbrtowc: Assertion `__mbsinit (data.__statep)' failed.
  zsh: abort (core dumped)  perl -MPOSIX=mblen -e 'mblen("a", 1)'

This broke in v5.27.8-134-g6c9ff7e96e which made the function
use mbrlen(3) under the hood on threaded perls.

The problem is initialization of the shift state with

  mbrlen(NULL, 0, &ps));

The glibc documentation for mbrlen(3) at

  https://www.gnu.org/software/libc/manual/html_node/Converting-a-Character.html#Converting-a-Character

does not mention initialization by passing in a null pointer for the
string, only a pointer to a NUL wide character.

   If the next multibyte character corresponds to the NUL wide character,
   the return value is 0. If the next n bytes form a valid multibyte
   character, the number of bytes belonging to this multibyte character
   byte sequence is returned.

Use memset(3) instead for mbstate_t initialization, as suggested in

  https://www.gnu.org/software/libc/manual/html_node/Keeping-the-state.html

with the hope that this is more portable.

While at it, add a few basic test cases. These are in a new file because
they need fresh_perl_is() from test.pl while the existing ones use
Test::More (and conversion of at least posix.t looks way too involved.)

Bug-Ubuntu: https://bugs.launchpad.net/bugs/1818953
---
 MANIFEST               |  1 +
 ext/POSIX/POSIX.xs     |  2 +-
 ext/POSIX/lib/POSIX.pm |  2 +-
 ext/POSIX/t/mb.t       | 47 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 ext/POSIX/t/mb.t

diff --git a/MANIFEST b/MANIFEST
index 4cf40a8eec..57465859b9 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -4182,6 +4182,7 @@ ext/POSIX/POSIX.xs		POSIX extension external subroutines
 ext/POSIX/t/export.t		Test @EXPORT and @EXPORT_OK
 ext/POSIX/t/iscrash		See if POSIX isxxx() crashes with threads on Win32
 ext/POSIX/t/math.t		Basic math tests for POSIX
+ext/POSIX/t/mb.t		Multibyte function tests for POSIX
 ext/POSIX/t/posix.t		See if POSIX works
 ext/POSIX/t/sigaction.t		See if POSIX::sigaction works
 ext/POSIX/t/sigset.t		See if POSIX::SigSet works
diff --git a/ext/POSIX/POSIX.xs b/ext/POSIX/POSIX.xs
index 1ebc358af4..27051c1fdb 100644
--- a/ext/POSIX/POSIX.xs
+++ b/ext/POSIX/POSIX.xs
@@ -3329,7 +3329,7 @@ mblen(s, n)
 #endif
     CODE:
 #if defined(USE_ITHREADS) && defined(HAS_MBRLEN)
-        PERL_UNUSED_RESULT(mbrlen(NULL, 0, &ps));   /* Initialize state */
+        memset(&ps, 0, sizeof(ps)); /* Initialize state */
         RETVAL = mbrlen(s, n, &ps); /* Prefer reentrant version */
 #else
         RETVAL = mblen(s, n);
diff --git a/ext/POSIX/lib/POSIX.pm b/ext/POSIX/lib/POSIX.pm
index 25392be4b1..4de039410f 100644
--- a/ext/POSIX/lib/POSIX.pm
+++ b/ext/POSIX/lib/POSIX.pm
@@ -4,7 +4,7 @@ use warnings;
 
 our ($AUTOLOAD, %SIGRT);
 
-our $VERSION = '1.87';
+our $VERSION = '1.88';
 
 require XSLoader;
 
diff --git a/ext/POSIX/t/mb.t b/ext/POSIX/t/mb.t
new file mode 100644
index 0000000000..961edf6cf2
--- /dev/null
+++ b/ext/POSIX/t/mb.t
@@ -0,0 +1,47 @@
+#!./perl
+
+# These tests are in a separate file, because they use fresh_perl_is()
+# from test.pl.
+
+# The mb* functions use the "underlying locale" that is not affected by
+# the Perl one.  So we run the tests in a separate "fresh_perl" process
+# with the correct LC_CTYPE set in the environment.
+
+BEGIN {
+    require Config; import Config;
+    if ($^O ne 'VMS' and $Config{'extensions'} !~ /\bPOSIX\b/) {
+	print "1..0\n";
+	exit 0;
+    }
+    unshift @INC, "../../t";
+    require 'loc_tools.pl';
+    require 'test.pl';
+}
+
+plan tests => 3;
+
+use POSIX qw();
+
+SKIP: {
+    skip("mblen() not present", 3) unless $Config{d_mblen};
+
+    is(&POSIX::mblen("a", &POSIX::MB_CUR_MAX), 1, 'mblen() basically works');
+
+    skip("LC_CTYPE locale support not available", 2)
+      unless locales_enabled('LC_CTYPE');
+
+    my $utf8_locale = find_utf8_ctype_locale();
+    skip("no utf8 locale available", 2) unless $utf8_locale;
+
+    local $ENV{LC_CTYPE} = $utf8_locale;
+    local $ENV{LC_ALL};
+    delete $ENV{LC_ALL};
+
+    fresh_perl_is(
+      'use POSIX; print &POSIX::mblen("\x{c3}\x{28}", &POSIX::MB_CUR_MAX)',
+      -1, {}, 'mblen() recognizes invalid multibyte characters');
+
+    fresh_perl_is(
+     'use POSIX; print &POSIX::mblen("\N{GREEK SMALL LETTER SIGMA}", &POSIX::MB_CUR_MAX)',
+     2, {}, 'mblen() works on UTF-8 characters');
+}
-- 
2.20.1

@p5pRT
Copy link
Author

p5pRT commented Mar 16, 2019

From @jkeenan

On Sat, 16 Mar 2019 10​:19​:33 GMT, ntyni@​debian.org wrote​:

On Thu, Mar 14, 2019 at 05​:07​:27AM -0700, James E Keenan via RT wrote​:

On Wed, 13 Mar 2019 21​:27​:49 GMT, jkeenan wrote​:

Made available for smoke-testing in this branch​:

smoke-me/jkeenan/ntyni/133928-posix

Niko, we're getting lots of smoke-test failures in this branch. See
http​://perl.develop-help.com/?b=smoke-me%2Fjkeenan%2Fntyni%2F133928-
posix

Specifically, failing on Linux when using g++. Failing on NetBSD
when using either gcc or g++.

Seems to be failing on pretty much all BSD variants.

However, I just tried and it works fine for me on FreeBSD 12. And also
on Linux with g++.

The best I can think of is that all the failing ones have set LC_ALL=C
or something like that, which seems a bit improbable. Nevertheless,
here's an updated patch that shouldn't break in the presence of
LC_ALL.

Could we please see if this fares any better?

New branch for smoke-testing​:

smoke-me/jkeenan/ntyni/133928-posix-2nd

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2019

From @ntyni

On Sat, Mar 16, 2019 at 06​:08​:05AM -0700, James E Keenan via RT wrote​:

On Sat, 16 Mar 2019 10​:19​:33 GMT, ntyni@​debian.org wrote​:

The best I can think of is that all the failing ones have set LC_ALL=C
or something like that, which seems a bit improbable. Nevertheless,
here's an updated patch that shouldn't break in the presence of
LC_ALL.

Could we please see if this fares any better?

New branch for smoke-testing​:

smoke-me/jkeenan/ntyni/133928-posix-2nd

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

I looked a bit into this and I'm sort of baffled. I'd appreciate
some eyeballs on this.

openbsd% locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

Without -Dusethreads, where my patch doesn't make a difference, the
behaviour looks correct to me (giving 2 for the test case below in a
UTF-8 locale).

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

However, with -Dusethreads, it seems that LC_CTYPE is not effective
while LANG is. Without my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
-1

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?
--
Niko

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

From @khwilliamson

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

On Sat, Mar 16, 2019 at 06​:08​:05AM -0700, James E Keenan via RT wrote​:

On Sat, 16 Mar 2019 10​:19​:33 GMT, ntyni@​debian.org wrote​:

The best I can think of is that all the failing ones have set LC_ALL=C
or something like that, which seems a bit improbable. Nevertheless,
here's an updated patch that shouldn't break in the presence of
LC_ALL.

Could we please see if this fares any better?

New branch for smoke-testing​:

smoke-me/jkeenan/ntyni/133928-posix-2nd

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

I looked a bit into this and I'm sort of baffled. I'd appreciate
some eyeballs on this.

openbsd% locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=

Without -Dusethreads, where my patch doesn't make a difference, the
behaviour looks correct to me (giving 2 for the test case below in a
UTF-8 locale).

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

However, with -Dusethreads, it seems that LC_CTYPE is not effective
while LANG is. Without my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
-1

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the execution.

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2019

From @ntyni

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the execution.

Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

  On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
  and locname is only meaningful if it is “C” or “POSIX”, if
  it ends with “.UTF-8”, or if it is an empty string; otherwise,
  newlocale() always returns the C locale.

  On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
  is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1. I think (but am not quite sure) that the following standalone
program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
  locale_t l, l2;
  l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
  uselocale(l);
  l2 = newlocale(LC_COLLATE_MASK, "C", l);
  uselocale(l2);
  printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)
--
Niko

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2019

From @ntyni

openbsd% env -i LC_CTYPE=fi_FI.UTF-8 PERL_DEBUG_LOCALE_INIT=1 ./perl -DLv -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
locale.c​:3362​: created C object 1
locale.c​:514​: emulate_setlocale input=0 (LC_ALL), "C", 6, 1
locale.c​:562​: category name is LC_ALL; mask is 0x7e
locale.c​:982​: emulate_setlocale was using ffffffffffffffff
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 0
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=0 (LC_ALL), "", 6, 1
locale.c​:562​: category name is LC_ALL; mask is 0x7e
locale.c​:514​: emulate_setlocale input=4 (LC_NUMERIC), "C", 0, 1
locale.c​:562​: category name is LC_NUMERIC; mask is 0x10
locale.c​:982​: emulate_setlocale was using 1
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 0
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=2 (LC_CTYPE), "fi_FI.UTF-8", 1, 1
locale.c​:562​: category name is LC_CTYPE; mask is 0x4
locale.c​:982​: emulate_setlocale was using 1
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 2; should have freed 0
locale.c​:1087​: emulate_setlocale now using 2
locale.c​:514​: emulate_setlocale input=1 (LC_COLLATE), "C", 2, 1
locale.c​:562​: category name is LC_COLLATE; mask is 0x2
locale.c​:982​: emulate_setlocale was using 2
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 2
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=5 (LC_TIME), "C", 3, 1
locale.c​:562​: category name is LC_TIME; mask is 0x20
locale.c​:982​: emulate_setlocale was using 1
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 0
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=6 (LC_MESSAGES), "C", 4, 1
locale.c​:562​: category name is LC_MESSAGES; mask is 0x40
locale.c​:982​: emulate_setlocale was using 1
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 0
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=3 (LC_MONETARY), "C", 5, 1
locale.c​:562​: category name is LC_MONETARY; mask is 0x8
locale.c​:982​: emulate_setlocale was using 1
locale.c​:1005​: emulate_setlocale now using 1
locale.c​:1051​: emulate_setlocale created 1; should have freed 0
locale.c​:1087​: emulate_setlocale now using 1
locale.c​:514​: emulate_setlocale input=0 (LC_ALL), "(null)", 6, 1
locale.c​:562​: category name is LC_ALL; mask is 0x7e
locale.c​:574​: emulate_setlocale querying 1
locale.c​:677​: emulate_setlocale i=0, name=LC_NUMERIC, locale=C
locale.c​:677​: emulate_setlocale i=1, name=LC_CTYPE, locale=fi_FI.UTF-8
locale.c​:677​: emulate_setlocale i=2, name=LC_COLLATE, locale=C
locale.c​:677​: emulate_setlocale i=3, name=LC_TIME, locale=C
locale.c​:677​: emulate_setlocale i=4, name=LC_MESSAGES, locale=C
locale.c​:677​: emulate_setlocale i=5, name=LC_MONETARY, locale=C
locale.c​:710​: emulate_setlocale i=0, name=LC_NUMERIC, locale=C
locale.c​:710​: emulate_setlocale i=1, name=LC_CTYPE, locale=fi_FI.UTF-8
locale.c​:710​: emulate_setlocale i=2, name=LC_COLLATE, locale=C
locale.c​:710​: emulate_setlocale i=3, name=LC_TIME, locale=C
locale.c​:710​: emulate_setlocale i=4, name=LC_MESSAGES, locale=C
locale.c​:710​: emulate_setlocale i=5, name=LC_MONETARY, locale=C
locale.c​:724​: emulate_setlocale returning LC_NUMERIC=C;LC_CTYPE=fi_FI.UTF-8;LC_COLLATE=C;LC_TIME=C;LC_MESSAGES=C;LC_MONETARY=C;
locale.c​:3471​: setlocale(LC_ALL, "") returned "LC_NUMERIC=C;LC_CTYPE=fi_FI.UTF-8;LC_COLLATE=C;LC_TIME=C;LC_MESSAGES=C;LC_MONETARY=C;"
locale.c​:514​: emulate_setlocale input=4 (LC_NUMERIC), "(null)", 0, 0
locale.c​:526​: finding index of category 4 (LC_NUMERIC)
locale.c​:550​: index is 0 for LC_NUMERIC
locale.c​:562​: category name is LC_NUMERIC; mask is 0x10
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning C
locale.c​:3496​: setlocale(LC_NUMERIC, NULL) returned "C"
locale.c​:514​: emulate_setlocale input=2 (LC_CTYPE), "(null)", 0, 0
locale.c​:526​: finding index of category 2 (LC_CTYPE)
locale.c​:550​: index is 1 for LC_CTYPE
locale.c​:562​: category name is LC_CTYPE; mask is 0x4
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning fi_FI.UTF-8
locale.c​:3496​: setlocale(LC_CTYPE, NULL) returned "fi_FI.UTF-8"
locale.c​:514​: emulate_setlocale input=1 (LC_COLLATE), "(null)", 0, 0
locale.c​:526​: finding index of category 1 (LC_COLLATE)
locale.c​:550​: index is 2 for LC_COLLATE
locale.c​:562​: category name is LC_COLLATE; mask is 0x2
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning C
locale.c​:3496​: setlocale(LC_COLLATE, NULL) returned "C"
locale.c​:514​: emulate_setlocale input=5 (LC_TIME), "(null)", 0, 0
locale.c​:526​: finding index of category 5 (LC_TIME)
locale.c​:550​: index is 3 for LC_TIME
locale.c​:562​: category name is LC_TIME; mask is 0x20
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning C
locale.c​:3496​: setlocale(LC_TIME, NULL) returned "C"
locale.c​:514​: emulate_setlocale input=6 (LC_MESSAGES), "(null)", 0, 0
locale.c​:526​: finding index of category 6 (LC_MESSAGES)
locale.c​:550​: index is 4 for LC_MESSAGES
locale.c​:562​: category name is LC_MESSAGES; mask is 0x40
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning C
locale.c​:3496​: setlocale(LC_MESSAGES, NULL) returned "C"
locale.c​:514​: emulate_setlocale input=3 (LC_MONETARY), "(null)", 0, 0
locale.c​:526​: finding index of category 3 (LC_MONETARY)
locale.c​:550​: index is 5 for LC_MONETARY
locale.c​:562​: category name is LC_MONETARY; mask is 0x8
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning C
locale.c​:3496​: setlocale(LC_MONETARY, NULL) returned "C"
locale.c​:514​: emulate_setlocale input=2 (LC_CTYPE), "(null)", 0, 0
locale.c​:526​: finding index of category 2 (LC_CTYPE)
locale.c​:550​: index is 1 for LC_CTYPE
locale.c​:562​: category name is LC_CTYPE; mask is 0x4
locale.c​:574​: emulate_setlocale querying 1
locale.c​:640​: emulate_setlocale returning fi_FI.UTF-8
PL_locale_utf8ness is now
  C0
  POSIX0
  fi_FI.UTF-80; returning 0
Called new_numeric with C, PL_numeric_name=C
Locale radix is '.', ?UTF-8=0
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1
./vutil.c​: 696​: lock lc_numeric_standard​: new depth=2
./vutil.c​: 712​: lc_numeric_standard decrement lock, new depth=1

EXECUTING...

1

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2019

From @khwilliamson

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the execution.

Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

On OpenBSD\, mask is only meaningful if it includes LC\_CTYPE\_MASK\,
and locname is only meaningful if it is “C” or “POSIX”\, if
it ends with “\.UTF\-8”\, or if it is an empty string; otherwise\,
newlocale\(\) always returns the C locale\.

On OpenBSD\, newlocale\(\) ignores oldloc\, and passing \(locale\_t\)0
is recommended\.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd. My
guess after looking at this trace is that it isn't the cause, but could
cause memory leaks, but maybe not since they only have those few fixed
locales. But it's a simple matter of changing locale.c to have a
my_newlocale() which is called instead of newlocale(), and is precisely
newlocale() on non-openbsd, but on openbsd instead does a freelocale()
of the third argument before calling newlocale() with a 0 third argument.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1.

But why is it 1? That sounds like a platform bug. But I think other
tests would be failing if this were the case. You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
locale_t l, l2;
l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
uselocale(l);
l2 = newlocale(LC_COLLATE_MASK, "C", l);
uselocale(l2);
printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 24, 2019

From @khwilliamson

Niko,

I now have access to an openbsd box, so am working on this. I'll let
you know what I find, but I did try your hypotheses about the third
argument to newlocale(), and it didn't change things.

On 3/24/19 3​:58 PM, Karl Williamson wrote​:

On 3/24/19 3​:22 PM, Karl Williamson wrote​:

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the execution.
Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

    On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
    and locname is only meaningful if it is “C” or “POSIX”, if
    it ends with “.UTF-8”, or if it is an empty string; otherwise,
    newlocale() always returns the C locale.

    On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
    is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd.  My
guess after looking at this trace is that it isn't the cause, but
could cause memory leaks, but maybe not since they only have those few
fixed locales.  But it's a simple matter of changing locale.c to have
a my_newlocale() which is called instead of newlocale(), and is
precisely newlocale() on non-openbsd, but on openbsd instead does a
freelocale() of the third argument before calling newlocale() with a 0
third argument.

Actually, upon further reading of the man pages there, I don't think
this would do any good.  freelocale() is a no-op there.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1.

But why is it 1?  That sounds like a platform bug.  But I think other
tests would be failing if this were the case.  You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
     locale_t l, l2;
     l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
     uselocale(l);
     l2 = newlocale(LC_COLLATE_MASK, "C", l);
     uselocale(l2);
     printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 25, 2019

From @khwilliamson

On 3/24/19 3​:22 PM, Karl Williamson wrote​:

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the execution.
Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

    On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
    and locname is only meaningful if it is “C” or “POSIX”, if
    it ends with “.UTF-8”, or if it is an empty string; otherwise,
    newlocale() always returns the C locale.

    On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
    is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd.  My
guess after looking at this trace is that it isn't the cause, but could
cause memory leaks, but maybe not since they only have those few fixed
locales.  But it's a simple matter of changing locale.c to have a
my_newlocale() which is called instead of newlocale(), and is precisely
newlocale() on non-openbsd, but on openbsd instead does a freelocale()
of the third argument before calling newlocale() with a 0 third argument.

Actually, upon further reading of the man pages there, I don't think
this would do any good. freelocale() is a no-op there.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1.

But why is it 1?  That sounds like a platform bug.  But I think other
tests would be failing if this were the case.  You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
     locale_t l, l2;
     l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
     uselocale(l);
     l2 = newlocale(LC_COLLATE_MASK, "C", l);
     uselocale(l2);
     printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 25, 2019

From @khwilliamson

On 3/24/19 4​:34 PM, Karl Williamson wrote​:

Niko,

I now have access to an openbsd box, so am working on this.  I'll let
you know what I find, but I did try your hypotheses about the third
argument to newlocale(), and it didn't change things.

I pushed your patch, as this looks like its a bug in the OS or perhaps
perl's initialization. I'm working on a C program to reproduce it
without involving perl.

On 3/24/19 3​:58 PM, Karl Williamson wrote​:

On 3/24/19 3​:22 PM, Karl Williamson wrote​:

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the
execution.
Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

    On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
    and locname is only meaningful if it is “C” or “POSIX”, if
    it ends with “.UTF-8”, or if it is an empty string; otherwise,
    newlocale() always returns the C locale.

    On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
    is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd.  My
guess after looking at this trace is that it isn't the cause, but
could cause memory leaks, but maybe not since they only have those
few fixed locales.  But it's a simple matter of changing locale.c to
have a my_newlocale() which is called instead of newlocale(), and is
precisely newlocale() on non-openbsd, but on openbsd instead does a
freelocale() of the third argument before calling newlocale() with a
0 third argument.

Actually, upon further reading of the man pages there, I don't think
this would do any good.  freelocale() is a no-op there.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1.

But why is it 1?  That sounds like a platform bug.  But I think other
tests would be failing if this were the case.  You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
     locale_t l, l2;
     l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
     uselocale(l);
     l2 = newlocale(LC_COLLATE_MASK, "C", l);
     uselocale(l2);
     printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 25, 2019

From @khwilliamson

On 3/24/19 9​:54 PM, Karl Williamson wrote​:

On 3/24/19 4​:34 PM, Karl Williamson wrote​:

Niko,

I now have access to an openbsd box, so am working on this.  I'll let
you know what I find, but I did try your hypotheses about the third
argument to newlocale(), and it didn't change things.

I pushed your patch, as this looks like its a bug in the OS or perhaps
perl's initialization.  I'm working on a C program to reproduce it
without involving perl.

The attached C program reproduces the problem, and I believe shows that
openbsd's implementation has a design flaw. I'm posting it here for
comment before I submit it to them

On 3/24/19 3​:58 PM, Karl Williamson wrote​:

On 3/24/19 3​:22 PM, Karl Williamson wrote​:

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT
wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the
execution.
Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

    On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
    and locname is only meaningful if it is “C” or “POSIX”, if
    it ends with “.UTF-8”, or if it is an empty string; otherwise,
    newlocale() always returns the C locale.

    On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
    is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd.
My guess after looking at this trace is that it isn't the cause, but
could cause memory leaks, but maybe not since they only have those
few fixed locales.  But it's a simple matter of changing locale.c to
have a my_newlocale() which is called instead of newlocale(), and is
precisely newlocale() on non-openbsd, but on openbsd instead does a
freelocale() of the third argument before calling newlocale() with a
0 third argument.

Actually, upon further reading of the man pages there, I don't think
this would do any good.  freelocale() is a no-op there.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because MB_CUR_MAX
is 1.

But why is it 1?  That sounds like a platform bug.  But I think
other tests would be failing if this were the case.  You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
     locale_t l, l2;
     l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
     uselocale(l);
     l2 = newlocale(LC_COLLATE_MASK, "C", l);
     uselocale(l2);
     printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 25, 2019

From @khwilliamson

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int
main (int argc, char ** argv)
{
    /* Setting LC_CTYPE to C.UTF-8 properly makes a UTF-8 locale */
    locale_t ctype = newlocale(LC_CTYPE_MASK, "C.UTF-8", (locale_t) 0);
    uselocale(ctype);
    fprintf(stderr, "MB_CUR_MAX after ctype=%zu\n", MB_CUR_MAX);

    /* But then setting LC_NUMERIC destroys the LC_CTYPE setting.  Setting a
     * particular category should not affect the current settings of any other
     * category.  On other systems, the third parameter to newlocale() would
     * have been 'ctype' to indicate that, but on openbsd that argument is
     * ignored.  This seems like a design flaw to me */
    locale_t numeric = newlocale(LC_NUMERIC_MASK, "C.UTF-8", (locale_t) 0);
    uselocale(numeric);
    fprintf(stderr, "MB_CUR_MAX after numeric=%zu\n", MB_CUR_MAX);
}

@p5pRT
Copy link
Author

p5pRT commented Mar 25, 2019

From @khwilliamson

On 3/25/19 1​:19 PM, Karl Williamson wrote​:

On 3/24/19 9​:54 PM, Karl Williamson wrote​:

On 3/24/19 4​:34 PM, Karl Williamson wrote​:

Niko,

I now have access to an openbsd box, so am working on this.  I'll let
you know what I find, but I did try your hypotheses about the third
argument to newlocale(), and it didn't change things.

I pushed your patch, as this looks like its a bug in the OS or perhaps
perl's initialization.  I'm working on a C program to reproduce it
without involving perl.

The attached C program reproduces the problem, and I believe shows that
openbsd's implementation has a design flaw.  I'm posting it here for
comment before I submit it to them

I need to apologize to Niko. He had figured things out, and indeed his
program is virtually the same as mine. When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again. If I had, it would have saved me some effort.

In thinking about it since, I'm unsure fhat openbsd needs to look at the
third argument. In their system there are only two locales (and you can
see that in the trace as they are numbered 1 and 2). One locale has all
categories be C. The other locale has LC_CTYPE be C.UTF-8. To
implement this properly, all one has to do is see if the mask parameter
includes LC_CTYPE. If not, do nothing. If so, set the locale to 1 or 2
depending on if the new name ends with UTF-8. I believe this has the
same effect as if one were to take into account the third parameter.

On 3/24/19 3​:58 PM, Karl Williamson wrote​:

On 3/24/19 3​:22 PM, Karl Williamson wrote​:

On 3/24/19 12​:54 PM, Niko Tyni wrote​:

On Sun, Mar 17, 2019 at 08​:09​:55PM -0700, karl williamson via RT
wrote​:

On 3/17/19 1​:55 PM, Niko Tyni wrote​:

Thanks. This looks much better. The only relevant failure left
AFAICS is on OpenBSD.

With -Dusethreads, with my patch​:

% LC_CTYPE=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
1
% LANG=fi_FI.UTF-8 ./perl -Ilib -MPOSIX=mblen -le 'print
mblen("\303\244", 5)'
2

Why isn't LC_CTYPE working here?

Use the -DLv flag on a DEBUGGING build to get locale handling trace.
Also set PERL_DEBUG_LOCALE_INIT=1 to be effective during the
execution.
Thanks. After glaring at this for a bit, I suspect the difference is
that on OpenBSD, newlocale(3) ignores the "oldloc" argument.

From https://man.openbsd.org/uselocale.3 :

    On OpenBSD, mask is only meaningful if it includes LC_CTYPE_MASK,
    and locname is only meaningful if it is “C” or “POSIX”, if
    it ends with “.UTF-8”, or if it is an empty string; otherwise,
    newlocale() always returns the C locale.

    On OpenBSD, newlocale() ignores oldloc, and passing (locale_t)0
    is recommended.

I'm attaching a log with PERL_DEBUG_LOCALE_INIT=1; unfortunately most
of the DEBUG_L() calls in locale.c are ineffective at locale init
time
because the command line options haven't been parsed yet.

I made the ones I thought important displayable at init time.

I've wondered about that third argument being ignored on openbsd.
My guess after looking at this trace is that it isn't the cause,
but could cause memory leaks, but maybe not since they only have
those few fixed locales.  But it's a simple matter of changing
locale.c to have a my_newlocale() which is called instead of
newlocale(), and is precisely newlocale() on non-openbsd, but on
openbsd instead does a freelocale() of the third argument before
calling newlocale() with a 0 third argument.

Actually, upon further reading of the man pages there, I don't think
this would do any good.  freelocale() is a no-op there.

The immediate reason for treating the locale as non-UTF8 is that
Perl__is_cur_LC_category_utf8() rules out UTF8-ness because
MB_CUR_MAX
is 1.

But why is it 1?  That sounds like a platform bug.  But I think
other tests would be failing if this were the case.  You could run

cd t
PERL_DEBUG_FULL_TEST=2 ./perl -T -I../lib ../t/lib/locale.t

to see what is happening

I think (but am not quite sure) that the following standalone

program depicts the same behaviour. In any case, it gives a different
result for me on Linux (6) than on OpenBSD (1).

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(void) {
     locale_t l, l2;
     l = newlocale(LC_CTYPE_MASK, "fi_FI.UTF-8", (locale_t) 0);
     uselocale(l);
     l2 = newlocale(LC_COLLATE_MASK, "C", l);
     uselocale(l2);
     printf("%d\n", MB_CUR_MAX);
}

Hope this makes sense. Any further insight would be welcome :)

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2019

From @ntyni

On Mon, Mar 25, 2019 at 02​:14​:35PM -0700, karl williamson via RT wrote​:

I need to apologize to Niko. He had figured things out, and indeed his
program is virtually the same as mine. When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again. If I had, it would have saved me some effort.

Oh, no need to apologize! I could certainly have explained things clearer,
and must admit I was still a bit hazy on details at that point.

Too bad for the duplicate work, but I'm glad we arrived at the same
conclusion :)

In thinking about it since, I'm unsure fhat openbsd needs to look at the
third argument. In their system there are only two locales (and you can
see that in the trace as they are numbered 1 and 2). One locale has all
categories be C. The other locale has LC_CTYPE be C.UTF-8. To
implement this properly, all one has to do is see if the mask parameter
includes LC_CTYPE. If not, do nothing. If so, set the locale to 1 or 2
depending on if the new name ends with UTF-8. I believe this has the
same effect as if one were to take into account the third parameter.

FWIW I also think this would work.

I guess the mb.t tests added in this ticket could use a skip/todo for
openbsd for as long as it has the current behaviour? See attached.
--
Niko

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2019

From @ntyni

0001-Skip-most-ext-POSIX-t-mb.t-tests-on-OpenBSD-with-thr.patch
From 70d50b021b6437654582fa2c77686ccd1426aa6f Mon Sep 17 00:00:00 2001
From: Niko Tyni <ntyni@debian.org>
Date: Tue, 26 Mar 2019 21:13:20 +0200
Subject: [PATCH] Skip most ext/POSIX/t/mb.t tests on OpenBSD with threads

As discussed in [perl #133928] the locale initialization
code does not currently work on OpenBSD. Skip the failing
tests for now.
---
 ext/POSIX/t/mb.t | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ext/POSIX/t/mb.t b/ext/POSIX/t/mb.t
index 961edf6cf2..321a6abbe2 100644
--- a/ext/POSIX/t/mb.t
+++ b/ext/POSIX/t/mb.t
@@ -30,6 +30,9 @@ SKIP: {
     skip("LC_CTYPE locale support not available", 2)
       unless locales_enabled('LC_CTYPE');
 
+    skip("locale initialization problems on OpenBSD / threaded [perl #133928]", 2)
+      if $^O eq 'openbsd' and $Config{useithreads};
+
     my $utf8_locale = find_utf8_ctype_locale();
     skip("no utf8 locale available", 2) unless $utf8_locale;
 
-- 
2.20.1

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2019

From @khwilliamson

On 3/26/19 1​:26 PM, Niko Tyni wrote​:

On Mon, Mar 25, 2019 at 02​:14​:35PM -0700, karl williamson via RT wrote​:

I need to apologize to Niko. He had figured things out, and indeed his
program is virtually the same as mine. When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again. If I had, it would have saved me some effort.

Oh, no need to apologize! I could certainly have explained things clearer,
and must admit I was still a bit hazy on details at that point.

Too bad for the duplicate work, but I'm glad we arrived at the same
conclusion :)

In thinking about it since, I'm unsure fhat openbsd needs to look at the
third argument. In their system there are only two locales (and you can
see that in the trace as they are numbered 1 and 2). One locale has all
categories be C. The other locale has LC_CTYPE be C.UTF-8. To
implement this properly, all one has to do is see if the mask parameter
includes LC_CTYPE. If not, do nothing. If so, set the locale to 1 or 2
depending on if the new name ends with UTF-8. I believe this has the
same effect as if one were to take into account the third parameter.

FWIW I also think this would work.

I guess the mb.t tests added in this ticket could use a skip/todo for
openbsd for as long as it has the current behaviour? See attached.

I'm thinking that this bug in OpenBSD necessitates us not using the
POSIX 20008 calls, so I'll either change the hints file, or if this is a
problem on other systems too, write a Configure probe.

In either event, these tests will start working without having to change
them.

@p5pRT
Copy link
Author

p5pRT commented Apr 2, 2019

From @jkeenan

On Tue, 26 Mar 2019 19​:56​:17 GMT, public@​khwilliamson.com wrote​:

On 3/26/19 1​:26 PM, Niko Tyni wrote​:

On Mon, Mar 25, 2019 at 02​:14​:35PM -0700, karl williamson via RT wrote​:

I need to apologize to Niko. He had figured things out, and indeed his
program is virtually the same as mine. When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again. If I had, it would have saved me some effort.

Oh, no need to apologize! I could certainly have explained things clearer,
and must admit I was still a bit hazy on details at that point.

Too bad for the duplicate work, but I'm glad we arrived at the same
conclusion :)

In thinking about it since, I'm unsure fhat openbsd needs to look at the
third argument. In their system there are only two locales (and you can
see that in the trace as they are numbered 1 and 2). One locale has all
categories be C. The other locale has LC_CTYPE be C.UTF-8. To
implement this properly, all one has to do is see if the mask parameter
includes LC_CTYPE. If not, do nothing. If so, set the locale to 1 or 2
depending on if the new name ends with UTF-8. I believe this has the
same effect as if one were to take into account the third parameter.

FWIW I also think this would work.

I guess the mb.t tests added in this ticket could use a skip/todo for
openbsd for as long as it has the current behaviour? See attached.

I'm thinking that this bug in OpenBSD necessitates us not using the
POSIX 20008 calls, so I'll either change the hints file, or if this is a
problem on other systems too, write a Configure probe.

In either event, these tests will start working without having to change
them.

Among our smoke-testing rigs ext/POSIX/t/mb.t is still experiencing failures on 2 OSes. See​:

http​://perl5.test-smoke.org/submatrix?test=../ext/POSIX/t/mb.t&pversion=5.29.10

Examples​:

OpenBSD-6.4​: http​://perl5.test-smoke.org/report/83977
HP-UX B.11.23/64​: http​://perl5.test-smoke.org/report/83901
--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Apr 3, 2019

From @khwilliamson

On 4/2/19 3​:38 PM, James E Keenan via RT wrote​:

On Tue, 26 Mar 2019 19​:56​:17 GMT, public@​khwilliamson.com wrote​:

On 3/26/19 1​:26 PM, Niko Tyni wrote​:

On Mon, Mar 25, 2019 at 02​:14​:35PM -0700, karl williamson via RT wrote​:

I need to apologize to Niko. He had figured things out, and indeed his
program is virtually the same as mine. When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again. If I had, it would have saved me some effort.

Oh, no need to apologize! I could certainly have explained things clearer,
and must admit I was still a bit hazy on details at that point.

Too bad for the duplicate work, but I'm glad we arrived at the same
conclusion :)

In thinking about it since, I'm unsure fhat openbsd needs to look at the
third argument. In their system there are only two locales (and you can
see that in the trace as they are numbered 1 and 2). One locale has all
categories be C. The other locale has LC_CTYPE be C.UTF-8. To
implement this properly, all one has to do is see if the mask parameter
includes LC_CTYPE. If not, do nothing. If so, set the locale to 1 or 2
depending on if the new name ends with UTF-8. I believe this has the
same effect as if one were to take into account the third parameter.

FWIW I also think this would work.

I guess the mb.t tests added in this ticket could use a skip/todo for
openbsd for as long as it has the current behaviour? See attached.

I'm thinking that this bug in OpenBSD necessitates us not using the
POSIX 20008 calls, so I'll either change the hints file, or if this is a
problem on other systems too, write a Configure probe.

In either event, these tests will start working without having to change
them.

Among our smoke-testing rigs ext/POSIX/t/mb.t is still experiencing failures on 2 OSes. See​:

http​://perl5.test-smoke.org/submatrix?test=../ext/POSIX/t/mb.t&pversion=5.29.10

Examples​:

OpenBSD-6.4​: http​://perl5.test-smoke.org/report/83977
HP-UX B.11.23/64​: http​://perl5.test-smoke.org/report/83901

OpenBSD is changing their code because of this. I'm leaning to a hints
file change

I have access to a HP_UX 11.23, and its what I consider their bug.
C.UTF-8 is not a fullfledged UTF-8 locale, unlike others on the system.
I'm leaning to changing the test to avoid using that particular locale.

@p5pRT
Copy link
Author

p5pRT commented Apr 3, 2019

From @ntyni

On Tue, Apr 02, 2019 at 07​:39​:35PM -0700, karl williamson via RT wrote​:

On 4/2/19 3​:38 PM, James E Keenan via RT wrote​:

Among our smoke-testing rigs ext/POSIX/t/mb.t is still experiencing failures on 2 OSes. See​:

http​://perl5.test-smoke.org/submatrix?test=../ext/POSIX/t/mb.t&pversion=5.29.10

Examples​:

OpenBSD-6.4​: http​://perl5.test-smoke.org/report/83977
HP-UX B.11.23/64​: http​://perl5.test-smoke.org/report/83901

OpenBSD is changing their code because of this. I'm leaning to a hints
file change

I have access to a HP_UX 11.23, and its what I consider their bug.
C.UTF-8 is not a fullfledged UTF-8 locale, unlike others on the system.
I'm leaning to changing the test to avoid using that particular locale.

Hi, please note that C.UTF-8 is the only UTF-8 locale enabled by default
on at least Debian based platforms. So for wider test coverage you might
want to avoid it only on HP-UX, or something like that...
--
Niko

@p5pRT
Copy link
Author

p5pRT commented Apr 3, 2019

From @khwilliamson

On 4/3/19 12​:35 PM, Niko Tyni wrote​:

On Tue, Apr 02, 2019 at 07​:39​:35PM -0700, karl williamson via RT wrote​:

On 4/2/19 3​:38 PM, James E Keenan via RT wrote​:

Among our smoke-testing rigs ext/POSIX/t/mb.t is still experiencing failures on 2 OSes. See​:

http​://perl5.test-smoke.org/submatrix?test=../ext/POSIX/t/mb.t&pversion=5.29.10

Examples​:

OpenBSD-6.4​: http​://perl5.test-smoke.org/report/83977
HP-UX B.11.23/64​: http​://perl5.test-smoke.org/report/83901

OpenBSD is changing their code because of this. I'm leaning to a hints
file change

I have access to a HP_UX 11.23, and its what I consider their bug.
C.UTF-8 is not a fullfledged UTF-8 locale, unlike others on the system.
I'm leaning to changing the test to avoid using that particular locale.

Hi, please note that C.UTF-8 is the only UTF-8 locale enabled by default
on at least Debian based platforms. So for wider test coverage you might
want to avoid it only on HP-UX, or something like that...

Of course

@p5pRT
Copy link
Author

p5pRT commented Apr 6, 2019

From @khwilliamson

On 4/2/19 8​:39 PM, Karl Williamson wrote​:

On 4/2/19 3​:38 PM, James E Keenan via RT wrote​:

On Tue, 26 Mar 2019 19​:56​:17 GMT, public@​khwilliamson.com wrote​:

On 3/26/19 1​:26 PM, Niko Tyni wrote​:

On Mon, Mar 25, 2019 at 02​:14​:35PM -0700, karl williamson via RT wrote​:

I need to apologize to Niko.  He had figured things out, and indeed
his
program is virtually the same as mine.  When I first read it, I didn't
really understand the issue, and when I did understand it, I forgot to
read the program again.  If I had, it would have saved me some effort.

Oh, no need to apologize! I could certainly have explained things
clearer,
and must admit I was still a bit hazy on details at that point.

Too bad for the duplicate work, but I'm glad we arrived at the same
conclusion :)

In thinking about it since, I'm unsure fhat openbsd needs to look
at the
third argument.  In their system there are only two locales (and
you can
see that in the trace as they are numbered 1 and 2).  One locale
has all
categories be C.  The other locale has LC_CTYPE be C.UTF-8.  To
implement this properly, all one has to do is see if the mask
parameter
includes LC_CTYPE.  If not, do nothing.  If so, set the locale to 1
or 2
depending on if the new name ends with UTF-8.  I believe this has the
same effect as if one were to take into account the third parameter.

FWIW I also think this would work.

I guess the mb.t tests added in this ticket could use a skip/todo for
openbsd for as long as it has the current behaviour? See attached.

I'm thinking that this bug in OpenBSD necessitates us not using the
POSIX 20008 calls, so I'll either change the hints file, or if this is a
problem on other systems too, write a Configure probe.

In either event, these tests will start working without having to change
them.

Among our smoke-testing rigs ext/POSIX/t/mb.t is still experiencing
failures on 2 OSes.  See​:

http​://perl5.test-smoke.org/submatrix?test=../ext/POSIX/t/mb.t&pversion=5.29.10

Examples​:

OpenBSD-6.4​: http​://perl5.test-smoke.org/report/83977
HP-UX B.11.23/64​: http​://perl5.test-smoke.org/report/83901

OpenBSD is changing their code because of this.  I'm leaning to a hints
file change

Now done. OpenBSD should now be passing.

I have access to a HP_UX 11.23, and its what I consider their bug.
C.UTF-8 is not a fullfledged UTF-8 locale, unlike others on the system.
I'm leaning to changing the test to avoid using that particular locale.

I'm waiting for Tux to get 11.31 compiling again to see if HP eventually
fixed this bug. I'll put in a skip of this test when I see what the
conditions for it should be.

@p5pRT
Copy link
Author

p5pRT commented Apr 24, 2019

From @iabyn

On Sat, Apr 06, 2019 at 11​:59​:59AM -0600, Karl Williamson wrote​:

I'm waiting for Tux to get 11.31 compiling again to see if HP eventually
fixed this bug. I'll put in a skip of this test when I see what the
conditions for it should be.

There have since been several commits to ext/POSIX/t/mb.t. Is there any
reason for this ticket to be a 5.30.0 blocker any more?

--
That he said that that that that is is is debatable, is debatable.

@p5pRT p5pRT added the khw label Oct 25, 2019
@toddr toddr removed the khw label Oct 25, 2019
@khwilliamson
Copy link
Contributor

I don't see anything left to do in this ticket. Is it closable?

In the current branch smoke-me/khw-locale, I have added code to allow one to initialize the shift state for mblen by passing an undef as the first parameter. Please try it out.

@ntyni
Copy link
Contributor

ntyni commented Nov 29, 2019

Yes, I think this is closable. Thanks.

I tried smoke-me/khw-locale as far as running the test suite and checking that mblen() didn't start crashing again on threaded perl. All looks fine to me. I'm afraid I can't comment on the actual shift state initization part.

@khwilliamson
Copy link
Contributor

As a final note, HP did fix this by 11.31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants