Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode readdir bugs #11513

Open
p5pRT opened this issue Jul 19, 2011 · 25 comments
Open

Unicode readdir bugs #11513

p5pRT opened this issue Jul 19, 2011 · 25 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 19, 2011

Migrated from rt.perl.org#95160 (status was 'open')

Searchable as RT95160$

@p5pRT
Copy link
Author

p5pRT commented Jul 19, 2011

From tchrist@perl.com

I'm really rather unhappy with the what you see isn't
what you get approach Perl is taking here.

Consider this​:

  #!/usr/bin/env perl
  use v5.12;
  use utf8;
  use strict;
  use autodie;
  use warnings;
  binmode(STDOUT, "​:utf8");
  binmode(STDERR, "​:utf8");
  END { close STDOUT }
  my @​στιγματα = qw( ΣΤΙΓΜΑΣ στιγμασ στιγμας );
  for my $στιγμα (@​στιγματα) {
  my $fh;
  open $fh, "> :utf8", $στιγμα;
  say $fh "στιγμα";
  close $fh;
  }
  opendir(my $dh, ".");
  while (readdir($dh)) {
  say if /\P{ASCII}/;
  }
  closedir($dh);

Run on Linux, I get this nonsense​:

  Ï�Ï�ιγμαÏ�
  Ï�Ï�ιγμαÏ�
  ΣΤÎ�Î�Î�Î�Σ

Run on Darwin, I get this, which is even worse​:

  Ï�Ï�ιγμαÏ�
  ΣΤÎ�Î�Î�Î�Σ

*Who* told Perl it was ok to let me blithely use wide characters in
creat but then forbad me from using them in readdir? That's stupid.
Perl should forbid unencoded wide characters in syscalls. It already
does in syswrite. Why not here?

Yes, if I make my loop

  while (my $enc = readdir($dh)) {
  use Encode qw(decode);
  $_ = decode "UTF-8", $enc;
  say if /\P{ASCII}/;
  }

Then I get

  στιγμας
  στιγμασ
  ΣΤΙΓΜΑΣ

on Linux and

  στιγμας
  ΣΤΙΓΜΑΣ

on Darwin.

But that's nutty, and in several ways.

First off, Darwin's case-insensitive filesytem is an idiot, and doesn't
work correctly. Notice how it not doing casefolding correctly. It
let me create two files that are casefolds of each other, even though
all three are such.

But secondly and of greater importance, I should be able to
do something like​:

  binmode($dh, "​:utf8");

or even

  opendir(my $dh, "​:utf8", ".");

And not have to deal with this really really stupid encoding business.

Is there reason that this is not a bug that should be fixed?

And don't even get me started about glob(). It's broken, too.
Have fun with HFS+'s quasi-NFD filesystem, eh?

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:
 
  Platform​:
  osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
  uname='openbsd chthon 4.4 generic#0 i386 '
  config_args='-des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=y, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-lgdbm -lm -lutil -lc
  perllibs=-lm -lutil -lc
  libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
  cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:
  Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
  USE_PERL_ATOF
  Built under openbsd
  Compiled at Jun 11 2011 11​:48​:28
  %ENV​:
  PERL_UNICODE="SA"
  @​INC​:
  /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/site_perl/5.14.0
  /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/5.14.0
  /usr/local/lib/perl5/site_perl/5.12.3
  /usr/local/lib/perl5/site_perl/5.11.3
  /usr/local/lib/perl5/site_perl/5.10.1
  /usr/local/lib/perl5/site_perl/5.10.0
  /usr/local/lib/perl5/site_perl/5.8.7
  /usr/local/lib/perl5/site_perl/5.8.0
  /usr/local/lib/perl5/site_perl/5.6.0
  /usr/local/lib/perl5/site_perl/5.005
  /usr/local/lib/perl5/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2011

From @ikegami

On Tue, Jul 19, 2011 at 2​:39 PM, tchrist1 <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by tchrist1
# Please include the string​: [perl #95160]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=95160 >

I'm really rather unhappy with the what you see isn't
what you get approach Perl is taking here.

Consider this​:

#!/usr/bin/env perl
use v5.12;
use utf8;
use strict;
use autodie;
use warnings;
binmode(STDOUT, "​:utf8");
binmode(STDERR, "​:utf8");
END { close STDOUT }
my @​στιγματα = qw( ΣΤΙΓΜΑΣ στιγμασ στιγμας );
for my $στιγμα (@​στιγματα) {
my $fh;
open $fh, "&gt; :utf8", $στιγμα;
say $fh "στιγμα";
close $fh;
}
opendir(my $dh, ".");
while (readdir($dh)) {
say if /\P{ASCII}/;
}
closedir($dh);

Run on Linux, I get this nonsense​:

στιγμας
στιγμασ
ΣΤΙΓΜΑΣ

Just like​:

  - Input from STDIN must be decoded.
  - Output to STDOUT and STDERR must be encoded.

This applies​:

  - Input from @​ARGV and file names from builtins must be decoded.
  - File names passed to builtins must be encoded.

You can get away with not doing the fourth because you have an UTF-8 locale
and C<open> suffers from The Unicode Bug.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 20, 2011

From tchrist@perl.com

  +---------------------------------------------------------------------+
  | This is automatic mail from Tom Christiansen's answering service. |
  | UPDATED​: Mon Jul 18 12​:44​:53 MDT 2011 |
  +---------------------------------------------------------------------+

  I've got six weeks of work to do in that number of days, so I'll be
  be almost entirely out of email-touch during that time. From July 25-29th
  I'll be attending OSCON in Portland, and over weekend of August 12-14th
  I'll be attending my high-school reunion in Lake Geneva--and celebrating
  being done with updating Programming Perl. Cue the fireworks.

  In the meanwhile, I will not in general be reading, let alone answering,
  any incoming email. The five exceptions are as follows, with suggested
  tags to add to the subject line to make sure I notice them​:

#1​: Life-and-death situations -- why are you using email for that?
  (e.g.) Subject​: [1=DYING] blah blah blah

#2​: Personal family matters of my own relations -- again, try the phone.
  (e.g.) Subject​: [2=FAMILY] blah blah blah

#3​: Issues @​work w/my University textmining job *INVOLVING ME PERSONALLY*.
  (e.g.) Subject​: [3=WORK] blah blah blah

#4​: Prepping my 4.5h of Unicode talks for next week's conference in Portland.
  (e.g.) Subject​: [4=OSCON] blah blah blah

#5​: Prepping a kilopage of the Camel Book's 4th ed. for Production by mid-August.
  (e.g.) Subject​: [5=BOOK] blah blah blah

  Because I will *disconnecting my laptop from the Internet* so I can
  actually get something done, I'll be answering mail twice a day *only*​:

  1) once cheerfully between 5-7am MDT (UTC-0600)
  2) once perhaps rather grumpily between 6-8pm MDT (UTC-0600)

  I'm a morning person, so those are the only two choices you're liable to
  get​: gleeful or glowering, with little middle ground. I expect to answer
  no mail outside those five special categories listed above until well into
  August. It **might** happen, but never count on it.

  Thank you for your forebearance.

  --tom

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2011

From @cpansprout

On Tue Jul 19 11​:39​:04 2011, tom christiansen wrote​:

I'm really rather unhappy with the what you see isn't
what you get approach Perl is taking here.

Consider this​:

\#\!/usr/bin/env perl
use v5\.12;
use utf8;
use strict;
use autodie;
use warnings;
binmode\(STDOUT\, "&#8203;:utf8"\);
binmode\(STDERR\, "&#8203;:utf8"\);
END \{ close STDOUT  \}
my @&#8203;στιγματα = qw\( ΣΤΙΓΜΑΣ στιγμασ στιγμας \);
for my $στιγμα \(@&#8203;στιγματα\) \{
    my $fh;
    open $fh\, "> :utf8"\, $στιγμα;
    say $fh "στιγμα";
    close $fh;
\}
opendir\(my $dh\, "\."\);
while \(readdir\($dh\)\) \{
    say if /\\P\{ASCII\}/;
\}
closedir\($dh\);

Run on Linux, I get this nonsense​:

��ιγμα�
��ιγμα�
ΣΤ����Σ

Run on Darwin, I get this, which is even worse​:

��ιγμα�
ΣΤ����Σ

*Who* told Perl it was ok to let me blithely use wide characters in
creat but then forbad me from using them in readdir? That's stupid.
Perl should forbid unencoded wide characters in syscalls. It already
does in syswrite. Why not here?

Almost all (if not all?) Perl functions that take file names have this
problem. They all ignore the UTF8 flag.

I would suggest we use a ‘Wide character’ warning, as we have for print
and warn.

Then we also need a pragma to enable Unicode filenames in -e, open,
readdir, chdir, etc.

What should we call it?

What do we do on systems on which file names *are* just octet sequences
and nothing more? Make loading the pragma die? Make it warn? Do nothing?

Also, what about systems that support Unicode, but for which no one has
had the time to implement this? (I’m not going to do VMS, for instance.)

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2011

From @ikegami

On Sun, Sep 18, 2011 at 8​:40 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

Then we also need a pragma to enable Unicode filenames in -e, open,
readdir, chdir, etc.

What should we call it?

What do we do on systems on which file names *are* just octet sequences
and nothing more? Make loading the pragma die? Make it warn? Do nothing?

File names are meant to be read as text, so one can't really claim they're
just octet sequences. So the real question is what should we do when readdir
encounters a file name that doesn't cleanly decode using the encoding it's
expected to be encoded with (e.g. a file name that's not valid UTF-8 on a
box with a UTF-8 locale).

Also, what about systems that support Unicode, but for which no one has

had the time to implement this? (I’m not going to do VMS, for instance.)

Like open -| with multiple args on Windows?
Like use open :locale on Windows?
Croak.

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 2011

From @ap

* Eric Brine <ikegami@​adaelis.com> [2011-09-19 03​:20]​:

File names are meant to be read as text, so one can't really claim
they're just octet sequences. So the real question is what should we
do when readdir encounters a file name that doesn't cleanly decode
using the encoding it's expected to be encoded with (e.g. a file name
that's not valid UTF-8 on a box with a UTF-8 locale).

One could take a page from Python here and use its surrogate escape
error handling. There was a subthread about it a while ago​:
http​://www.nntp.perl.org/group/perl.perl5.porters/;msgid=A8767ACF-E6A0-498A-B402-54A12D26523B@​activestate.com

What this approach effectively does is allow strings to unambiguously
represent a mixture of bytes and characters, which in a roundabout way
essentially solves the problem that Perl only has a single string type.
But do note the later message about the security implications. It will
take some thought to get this clean, but there is a lot of potential in
it.

I love the idea and it is one of my todos to add this to Encode should
no one else get there first. The core could then use this method to
provide clean and nice interfaces to any OS APIs which are textual in
intent but binary in practice – as Python does.

It would be a major step forward for Perl.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2011

From @cpansprout

On Tue Oct 04 09​:02​:13 2011, aristotle wrote​:

* Eric Brine <ikegami@​adaelis.com> [2011-09-19 03​:20]​:

File names are meant to be read as text, so one can't really claim
they're just octet sequences. So the real question is what should we
do when readdir encounters a file name that doesn't cleanly decode
using the encoding it's expected to be encoded with

If that happens, then it’s not really text, is it?

(e.g. a file name
that's not valid UTF-8 on a box with a UTF-8 locale).

No, no, please don’t start using the locale to determine what the file
names are. That would mean that a change to an environment variable
would cause configuration files to start referring to other
‘nonexistent’ files (which exist when the locale is set correctly). We
should *only* support Unicode file names when the file system itself has
encoding information.

Mac OS X, for instance, stores the encoding in the file system (so each
volume could theoretically use a different encoding), but the low-level
drivers that read the volume translate everything to UTF-8. If you try
to create a file whose name is an invalid UTF-8 sequence, you get an
‘Invalid argument’ error.

On the other hand, if we keep things completely consistent on a given
platform (treat Linux as UTF-8, for instance, regardless of any
environment settings), then we could follow Aristotle’s suggestion below
for platforms that do not have an inherent file name encoding system.

Also, nobody has answered my question​: What do we call the pragma?
unicode​::filenames? I suppose we need to make a list first of which
functions will be affected, so here goes​:

dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open
opendir readlink rename rmdir stat symlink sysopen umask unlink utime do
require use

Those are all file name functions.

But what about user and group names?

exec, system, syscall, readpipe, bind, connect, getsockopt, shmwrite and
the various network functions (e.g., getservbyname) should produce ‘Wide
character’ warnings. (Someone who understands non-ASCII domain names
should speak up now.)

One could take a page from Python here and use its surrogate escape
error handling. There was a subthread about it a while ago​:
http​://www.nntp.perl.org/group/perl.perl5.porters/;msgid=A8767ACF-
E6A0-498A-B402-54A12D26523B@​activestate.com

What this approach effectively does is allow strings to unambiguously
represent a mixture of bytes and characters, which in a roundabout way
essentially solves the problem that Perl only has a single string
type.
But do note the later message about the security implications. It will
take some thought to get this clean, but there is a lot of potential
in
it.

I love the idea and it is one of my todos to add this to Encode should
no one else get there first. The core could then use this method to
provide clean and nice interfaces to any OS APIs which are textual in
intent but binary in practice – as Python does.

It would be a major step forward for Perl.

Regards,

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @Hugmeir

On Sun, Oct 23, 2011 at 7​:23 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Tue Oct 04 09​:02​:13 2011, aristotle wrote​:

* Eric Brine <ikegami@​adaelis.com> [2011-09-19 03​:20]​:

File names are meant to be read as text, so one can't really claim
they're just octet sequences. So the real question is what should we
do when readdir encounters a file name that doesn't cleanly decode
using the encoding it's expected to be encoded with

If that happens, then it’s not really text, is it?

(e.g. a file name
that's not valid UTF-8 on a box with a UTF-8 locale).

No, no, please don’t start using the locale to determine what the file
names are. That would mean that a change to an environment variable
would cause configuration files to start referring to other
‘nonexistent’ files (which exist when the locale is set correctly). We
should *only* support Unicode file names when the file system itself has
encoding information.

Mac OS X, for instance, stores the encoding in the file system (so each
volume could theoretically use a different encoding), but the low-level
drivers that read the volume translate everything to UTF-8. If you try
to create a file whose name is an invalid UTF-8 sequence, you get an
‘Invalid argument’ error.

On the other hand, if we keep things completely consistent on a given
platform (treat Linux as UTF-8, for instance, regardless of any
environment settings), then we could follow Aristotle’s suggestion below
for platforms that do not have an inherent file name encoding system.

Also, nobody has answered my question​: What do we call the pragma?
unicode​::filenames? I suppose we need to make a list first of which
functions will be affected, so here goes​:

dbmopen -X chdir chmod chown chroot fcntl glob link lstat mkdir open
opendir readlink rename rmdir stat symlink sysopen umask unlink utime do
require use

Those are all file name functions.

But what about user and group names?

exec, system, syscall, readpipe, bind, connect, getsockopt, shmwrite and
the various network functions (e.g., getservbyname) should produce ‘Wide
character’ warnings. (Someone who understands non-ASCII domain names
should speak up now.)

(Reading the Python thread is still on my TODO list, so I'm not commenting
on that yet)

There's a couple of things here being grouped as one. Ignoring
require/use/do for a moment, most of those functions already have bug
reports on them because, let me quote tchrist here,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid.
Perl should forbid unencoded wide characters in syscalls. It already
does in syswrite.

So, first thing​: Be like syswrite. -All- syscalls, sans for
say/print/printf/warn/die which already have exceptions, should croak if
passed non-downgradeable scalars. This needn't be a backwards-incompatible
nightmare -- Save for exec and system, Classic​::Perl could override them to
do something like
require Encode;
*CORE​::GLOBAL​::rename = sub ($$) { Encode​::SvUTF8_off($_[0]); goto
&CORE​::rename };
And there you go. You get Perl's previous ultralax behavior.

Second, there should be a way to avoid doing an encode/decode on every
syscall. Since I haven't read the Python thread yet I can't say much on
this, but for a while I've had a open-like pragma for this in mind, eg

use syscalls IN => "​:encoding(...)", OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)", OUT => "​:encoding(...)" }

Or somesuch, which won't solve problems in, say, Windows, but hopefully it
won't make them any worse. Then you could implement unicode​::filenames as a
wrapper around that, and if you want to grab that layer from a locale
setting, that's entirely up to you (just don't ask me to debug it later).

Third, require/use/do. I recall Python having some problems with this (if
the thread that I've neglected reading touches this, I apologize) -- And
actually, I don't know any language that supports it without issues, though
pointers are of course welcome.
Zefram had a great idea for this a while ago -- If a module has Unicode in
its path, it should get an alias, reachable through some escaping scheme or
another. So if I had a module Eeyup​::\x{30cb}​::Bothersome, Bothersome.pm
would be reachable through Eeyup/\x{30cb}/, and, failing that,
unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a couple
of hours, so it's certainly doable, though I haven't touched that in a while
because I can't figure out a way to test 2 portably.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @cpansprout

On Sun Oct 23 18​:26​:45 2011, Hugmeir wrote​:

There's a couple of things here being grouped as one. Ignoring
require/use/do for a moment, most of those functions already have bug
reports on them because, let me quote tchrist here,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid.
Perl should forbid unencoded wide characters in syscalls. It already
does in syswrite.

So, first thing​: Be like syswrite. -All- syscalls, sans for
say/print/printf/warn/die which already have exceptions, should croak if
passed non-downgradeable scalars.

(Please, don’t put -deable at the end of a Latin-based word. :-) It’s
‘downgradable’.)

syswrite seems to be the odd one out. It’s probably using SvPVbyte.
print, die, and warn just warn (i.e., warn chr 256 produces two
warnings). It’s a default warning, though.

With the new pragma, I would suggest fixing the Unicode bug for those
functions when the pragma is off (with a warning and fallback). If that
causes CPAN breakage, then the new behaviour should be enabled with ‘use

Second, there should be a way to avoid doing an encode/decode on every
syscall. Since I haven't read the Python thread yet I can't say much on
this, but for a while I've had a open-like pragma for this in mind, eg

use syscalls IN => "​:encoding(...)", OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)", OUT => "​:encoding(...)" }

Or somesuch, which won't solve problems in, say, Windows, but hopefully it
won't make them any worse.

I think it would make things worse, as we would have yet another
non-portable interface that is unusable as a result. In this case it’s
not even portable between Unix systems, because it cannot be used
correctly on Mac OS X, which forces file names on *all* Unix interfaces
to be in UTF-8.

On the other hand we could provide it with lots of caveats in the
documentation. Maybe it could be part of the same pragma.

Then you could implement unicode​::filenames as a
wrapper around that, and if you want to grab that layer from a locale
setting, that's entirely up to you (just don't ask me to debug it later).

Third, require/use/do. I recall Python having some problems with this (if
the thread that I've neglected reading touches this, I apologize) -- And
actually, I don't know any language that supports it without issues,
though
pointers are of course welcome.
Zefram had a great idea for this a while ago -- If a module has Unicode in
its path, it should get an alias, reachable through some escaping
scheme or
another. So if I had a module Eeyup​::\x{30cb}​::Bothersome, Bothersome.pm
would be reachable through Eeyup/\x{30cb}/, and, failing that,
unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a
couple
of hours, so it's certainly doable, though I haven't touched that in a
while
because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first, but I worry about modules
‘disappearing’ depending on what pragma is enabled.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @Hugmeir

On Sun, Oct 23, 2011 at 11​:44 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011, Hugmeir wrote​:

There's a couple of things here being grouped as one. Ignoring
require/use/do for a moment, most of those functions already have bug
reports on them because, let me quote tchrist here,

*Who* told Perl it was ok to let me blithely use wide characters in

creat but then forbad me from using them in readdir? That's stupid.
Perl should forbid unencoded wide characters in syscalls. It already
does in syswrite.

So, first thing​: Be like syswrite. -All- syscalls, sans for
say/print/printf/warn/die which already have exceptions, should croak if
passed non-downgradeable scalars.

(Please, don’t put -deable at the end of a Latin-based word. :-) It’s
‘downgradable’.)

But I like my half-broken english..! Fine :P

syswrite seems to be the odd one out. It’s probably using SvPVbyte.
print, die, and warn just warn (i.e., warn chr 256 produces two
warnings). It’s a default warning, though.

That's true, but consider which one of those has the actually useful
behavior. How many times have you gotten a "Wide character" warning that
left you with mostly worthless output, and had to rerun things by adding the
layers?

Also, how often do you actually want to pass the internal form of UTF-8 to
system calls? I'm not saying it can't happen, but it's certainly not the
common use case. On nearly every other occasion it's a bug that Perl isn't
reporting, and a warning in this case is twice as useless.

With the new pragma, I would suggest fixing the Unicode bug for those
functions when the pragma is off (with a warning and fallback). If that
causes CPAN breakage, then the new behaviour should be enabled with ‘use

I don't think it wouldn't cause any more breakage than when the Fcntl
constants subs became actual ()-prototyped constants. The only things that
"broke" were already broken, but Perl wasn't reporting it.

(I'd have little qualms if this were triggered by a 'use VERSION;' though)

Second, there should be a way to avoid doing an encode/decode on every
syscall. Since I haven't read the Python thread yet I can't say much on
this, but for a while I've had a open-like pragma for this in mind, eg

use syscalls IN => "​:encoding(...)", OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)", OUT => "​:encoding(...)" }

Or somesuch, which won't solve problems in, say, Windows, but hopefully
it
won't make them any worse.

I think it would make things worse, as we would have yet another
non-portable interface that is unusable as a result. In this case it’s
not even portable between Unix systems, because it cannot be used
correctly on Mac OS X, which forces file names on *all* Unix interfaces
to be in UTF-8.

On the other hand we could provide it with lots of caveats in the
documentation. Maybe it could be part of the same pragma.

Um, I'm not sure I follow. Isn't it as portable as the encode/decode calls
that you are forced to use right now? If so yeah, that's pretty bad, but you
can abstract that with something like

use PerlIO​::fse;
use syscalls :all => "​:fse";

Then you could implement unicode​::filenames as a
wrapper around that, and if you want to grab that layer from a locale
setting, that's entirely up to you (just don't ask me to debug it later).

Third, require/use/do. I recall Python having some problems with this (if
the thread that I've neglected reading touches this, I apologize) -- And
actually, I don't know any language that supports it without issues,
though
pointers are of course welcome.
Zefram had a great idea for this a while ago -- If a module has Unicode
in
its path, it should get an alias, reachable through some escaping
scheme or
another. So if I had a module Eeyup​::\x{30cb}​::Bothersome, Bothersome.pm
would be reachable through Eeyup/\x{30cb}/, and, failing that,
unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in a
couple
of hours, so it's certainly doable, though I haven't touched that in a
while
because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first, but I worry about modules
‘disappearing’ depending on what pragma is enabled.

I was thinking in terms of redefining how the core itself looks for the
modules, that is, change pp_require and friends. If it's implemented as
pragmata, then your worries are spot-on and that could certainly be
troublesome.
More boilerplate for the boilerplate god?

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @cpansprout

On Sun Oct 23 21​:00​:09 2011, Hugmeir wrote​:

On Sun, Oct 23, 2011 at 11​:44 PM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011, Hugmeir wrote​:
(Please, don’t put -deable at the end of a Latin-based word. :-)
It’s
‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so
often I thought maybe mentioning it once would give others a hint, too.

Generally, only the consonants c g k m v m z can have -eable after them,
but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m
*way* off topic.)

syswrite seems to be the odd one out. It’s probably using SvPVbyte.
print, die, and warn just warn (i.e., warn chr 256 produces two
warnings). It’s a default warning, though.

That's true, but consider which one of those has the actually useful
behavior. How many times have you gotten a "Wide character" warning
that
left you with mostly worthless output, and had to rerun things by
adding the
layers?

Several hundred. But those were one-time one-liners.

Also, how often do you actually want to pass the internal form of UTF-
8 to
system calls? I'm not saying it can't happen, but it's certainly not
the
common use case. On nearly every other occasion it's a bug that Perl
isn't
reporting, and a warning in this case is twice as useless.

I think we need to warn, for backward-compatibility. I know there have
been times that I relied on UTF-8 interfaces accepting Unicode strings,
without even realising what I was doing. My code worked, after all.
Then module upgrades broke things, but only every tenth time or so that
the code ran, so it remained buggy a long time.

With the new pragma, I would suggest fixing the Unicode bug for
those
functions when the pragma is off (with a warning and fallback). If
that
causes CPAN breakage, then the new behaviour should be enabled with
‘use

I don't think it wouldn't cause any more breakage than when the Fcntl
constants subs became actual ()-prototyped constants. The only things
that
"broke" were already broken, but Perl wasn't reporting it.

That’s my thought, but actual smoke reports tend to sway me quickly.

(I'd have little qualms if this were triggered by a 'use VERSION;'
though)

Second, there should be a way to avoid doing an encode/decode on
every
syscall. Since I haven't read the Python thread yet I can't say
much on
this, but for a while I've had a open-like pragma for this in
mind, eg

use syscalls IN => "​:encoding(...)", OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)", OUT =>
"​:encoding(...)" }

Or somesuch, which won't solve problems in, say, Windows, but
hopefully
it
won't make them any worse.

I think it would make things worse, as we would have yet another
non-portable interface that is unusable as a result. In this case
it’s
not even portable between Unix systems, because it cannot be used
correctly on Mac OS X, which forces file names on *all* Unix
interfaces
to be in UTF-8.

On the other hand we could provide it with lots of caveats in the
documentation. Maybe it could be part of the same pragma.

Um, I'm not sure I follow. Isn't it as portable as the encode/decode
calls
that you are forced to use right now? If so yeah, that's pretty bad,
but you
can abstract that with something like

use PerlIO​::fse;
use syscalls :all => "​:fse";

The whole point of the unicode​::filenames pragma is to eliminate the
need to have to specify encodings everywhere, at least as I envision it.
After all, Windows, VMS and Mac OS X all have character sequences for
file names. I think some FreeBSDs might, too, but I’m not sure. So
your explicit encoding suggestion just seems like a can of worms to me,
which will doubtless be misused in CPAN modules by those who don’t
really understand the issues.

Then you could implement unicode​::filenames as a
wrapper around that, and if you want to grab that layer from a
locale
setting, that's entirely up to you (just don't ask me to debug it
later).

Third, require/use/do. I recall Python having some problems with
this (if
the thread that I've neglected reading touches this, I apologize)
-- And
actually, I don't know any language that supports it without
issues,
though
pointers are of course welcome.
Zefram had a great idea for this a while ago -- If a module has
Unicode
in
its path, it should get an alias, reachable through some escaping
scheme or
another. So if I had a module Eeyup​::\x{30cb}​::Bothersome,
Bothersome.pm
would be reachable through Eeyup/\x{30cb}/, and, failing that,
unialias/Eeyup/130cb/

Here's the nicest thing -- I implemented 1 and a prototype of 2 in
a
couple
of hours, so it's certainly doable, though I haven't touched that
in a
while
because I can't figure out a way to test 2 portably.

It sounds like a nice idea at first, but I worry about modules
‘disappearing’ depending on what pragma is enabled.

I was thinking in terms of redefining how the core itself looks for
the
modules, that is, change pp_require and friends. If it's implemented
as
pragmata, then your worries are spot-on and that could certainly be
troublesome.

My initial train of thought was a little muddled. In any case, if perl
is to make multiple attempts to load the file, using different methods,
ignoring any pragmata, then that concern is irrelevant. But how many
attempts should perl be making?

If some OSes use Aristotle’s approach, then we only need *two* attempts,
and Zefram’s plan, although it would have been wonderful if 5.8 had
implemented it, will have to be discarded.

There are already people using ‘use Mödule’ on OS X. We shouldn’t break
their code.

More boilerplate for the boilerplate god?

???

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @khwilliamson

On 10/23/2011 10​:25 PM, Father Chrysostomos via RT wrote​:

On Sun Oct 23 21​:00​:09 2011, Hugmeir wrote​:

On Sun, Oct 23, 2011 at 11​:44 PM, Father Chrysostomos via RT<
perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011, Hugmeir wrote​:
(Please, don’t put -deable at the end of a Latin-based word. :-)
It’s
‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so
often I thought maybe mentioning it once would give others a hint, too.

Generally, only the consonants c g k m v m z can have -eable after them,
but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m
*way* off topic.)

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​:
df84a23
Jarkko Hietaniemi <jhi@​iki.fi> Wed, 31 Jan 2001

It's understandable that this spelling has become enshrined as valid.
FWIW, it's never bothered me, a native English speaker.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @cpansprout

On Mon Oct 24 07​:48​:27 2011, public@​khwilliamson.com wrote​:

On 10/23/2011 10​:25 PM, Father Chrysostomos via RT wrote​:

On Sun Oct 23 21​:00​:09 2011, Hugmeir wrote​:

On Sun, Oct 23, 2011 at 11​:44 PM, Father Chrysostomos via RT<
perlbug-followup@​perl.org> wrote​:

On Sun Oct 23 18​:26​:45 2011, Hugmeir wrote​:
(Please, don’t put -deable at the end of a Latin-based word. :-)
It’s
‘downgradable’.)

But I like my half-broken english..! Fine :P

Please don’t think I’m trying to pick on you. I just see this misuse so
often I thought maybe mentioning it once would give others a hint, too.

Generally, only the consonants c g k m v m z can have -eable after them,
but there are exceptions.

(You don’t know how long I’ve been wanting to bring this up--but now I’m
*way* off topic.)

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​:
df84a23
Jarkko Hietaniemi <jhi@​iki.fi> Wed, 31 Jan 2001

It's understandable that this spelling has become enshrined as valid.
FWIW, it's never bothered me, a native English speaker.

I’m a native English speaker, too, and it bothers me whenever I see it,
just like ‘referer’.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From tchrist@perl.com

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​:
df84a23

It's understandable that this spelling has become enshrined as valid.
FWIW, it's never bothered me, a native English speaker.

I’m a native English speaker, too, and it bothers me whenever I see it,
just like ‘referer’.

Now you know how I feel about “numify”. :(

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @khwilliamson

On 10/24/2011 09​:37 AM, Tom Christiansen wrote​:

The macro UTF8_IS_DOWNGRADEABLE_START has been in the core since​:
df84a23

It's understandable that this spelling has become enshrined as valid.
FWIW, it's never bothered me, a native English speaker.

I’m a native English speaker, too, and it bothers me whenever I see it,
just like ‘referer’.

Now you know how I feel about “numify”. :(

--tom

numify rhymes (the way I pronounce it) with mummify, which is what
happens when you have some Académie dictating what goes into a language
and what doesn't.

My grandmother (born 1885, raised on a Wisconsin farm) hated the term
'kid' when applied to a human child instead of a goat. I found that
surprising, and when I look it up just now, I see her meaning down the
list, and the 'human' meaning at the top.

I cringe when I hear 'less' when the 'proper' term is 'fewer'. I
recently had occasion to use 'pluralize'; I cringed every time I wrote
it, but it got the job done.

We are powerless over the vicissitudes of English, whose polyglot
mutations are, I believe, a major reason why it has supplanted French as
the required international language that everyone has to learn.

Vive le sandwich!

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From tchrist@perl.com

My grandmother (born 1885, raised on a Wisconsin farm)

How odd​: so was mine. 1919-2010.

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @ikegami

On Sun, Oct 23, 2011 at 9​:26 PM, Brian Fraser <fraserbn@​gmail.com> wrote​:

Second, there should be a way to avoid doing an encode/decode on every
syscall. Since I haven't read the Python thread yet I can't say much on
this, but for a while I've had a open-like pragma for this in mind, eg

use syscalls IN => "​:encoding(...)", OUT => "​:encoding(...)";

or

use syscalls :dir => { IN => "​:encoding(...)", OUT => "​:encoding(...)" }

When does it make sense to use two different encodings?

Are you saying that non-Windows system can't tell you which encoding it is
using?

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @Leont

On Mon, Oct 24, 2011 at 10​:07 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

Are you saying that non-Windows system can't tell you which encoding it is
using?

Most unices (pretty much all of them except OS X) do not have an
inherent encoding at all. Filenames are blobs.

Leon

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From tchrist@perl.com

Karl, it isn't about shifting word-use. That's a red herring.
Rather, it's about either rank ignorance or willful disregard
of the phonologic–orthographic texture of the *written* language.

That is not the way English has ever worked before in any existing
precedent. Mummify, mummification are the precedent you're looking
for here, *not* numen, numina, numinal, numinous, numinosity.

And somebody goofed. That doesn't make it right, or good.

It's just like children who get catachrestically named
Marybeht because their parents didn't know that you spell
the theta sound with a th in English, not with an ht.

Sure, you can do it. You can do anything. But it looks
stupid and it saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @ikegami

On Mon, Oct 24, 2011 at 4​:12 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

On Mon, Oct 24, 2011 at 10​:07 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

Are you saying that non-Windows system can't tell you which encoding it
is
using?

Most unices (pretty much all of them except OS X) do not have an
inherent encoding at all. Filenames are blobs.

Then how come I can read the file names in file selection dialogs on this
Debian box?

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2011

From @ilmari

Eric Brine <ikegami@​adaelis.com> writes​:

On Mon, Oct 24, 2011 at 4​:12 PM, Leon Timmermans <fawaka@​gmail.com> wrote​:

On Mon, Oct 24, 2011 at 10​:07 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

Are you saying that non-Windows system can't tell you which encoding it
is
using?

Most unices (pretty much all of them except OS X) do not have an
inherent encoding at all. Filenames are blobs.

Then how come I can read the file names in file selection dialogs on this
Debian box?

Because the toolkit assumes an encoding, usually UTF-8. See
<http​://www.gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-filename-charsets>
for how GTK+ determines it.

--
ilmari
"A disappointingly low fraction of the human race is,
at any given time, on fire." - Stig Sandbeck Mathisen

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2011

From @ap

* Tom Christiansen <tchrist@​perl.com> [2011-10-24 22​:35]​:

Sure, you can do it. You can do anything. But it looks stupid and it
saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

creat

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2011

From tchrist@perl.com

* Tom Christiansen <tchrist@​perl.com> [2011-10-24 22​:35]​:

Sure, you can do it. You can do anything. But it looks stupid and it
saddles the poor thing with a lifelong curse.

See also HTTP_REFERER.

creat

creat was not caused by not knowing how to spell the word create.

But you're right that it is something its inventors came
to regret having done.

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 10, 2011

From @Hugmeir

On Mon, Oct 24, 2011 at 1​:25 AM, Father Chrysostomos via RT <
perlbug-followup@​perl.org> wrote​:

That's true, but consider which one of those has the actually useful
behavior. How many times have you gotten a "Wide character" warning
that
left you with mostly worthless output, and had to rerun things by
adding the
layers?

Several hundred. But those were one-time one-liners.

Also, how often do you actually want to pass the internal form of UTF-
8 to
system calls? I'm not saying it can't happen, but it's certainly not
the
common use case. On nearly every other occasion it's a bug that Perl
isn't
reporting, and a warning in this case is twice as useless.

I think we need to warn, for backward-compatibility. I know there have
been times that I relied on UTF-8 interfaces accepting Unicode strings,
without even realising what I was doing. My code worked, after all.
Then module upgrades broke things, but only every tenth time or so that
the code ran, so it remained buggy a long time.

I don't think it wouldn't cause any more breakage than when the Fcntl
constants subs became actual ()-prototyped constants. The only things
that
"broke" were already broken, but Perl wasn't reporting it.

That’s my thought, but actual smoke reports tend to sway me quickly.

Actually, how about a CPAN smoke of this? If the extent of the breakage is
reasonable, I'll personally send patches to all the affected modules : )
And as an added bonus, even if the core doesn't change to croak, it'll
improve the overall robustness of CPAN!

The whole point of the unicode​::filenames pragma is to eliminate the
need to have to specify encodings everywhere, at least as I envision it.
After all, Windows, VMS and Mac OS X all have character sequences for
file names. I think some FreeBSDs might, too, but I’m not sure. So
your explicit encoding suggestion just seems like a can of worms to me,
which will doubtless be misused in CPAN modules by those who don’t
really understand the issues.

Hm.. That's true enough. I was a bit wary of something automatically
picking the fs encoding for me, but then I noticed that the most common use
case of a pragma that had you explicitly set the encodings would be to load
a module to do exactly that! (e.g. the PerlIO​::fse example in my previous
mail). Having that as the default seems reasonable.
Though it would be swell if it provided a way to override those defaults.

(Would you consider calling it unicode​::syscalls or somesuch? :​:filenames
implies it wouldn't affect, say, qx//)

My initial train of thought was a little muddled. In any case, if perl
is to make multiple attempts to load the file, using different methods,
ignoring any pragmata, then that concern is irrelevant. But how many
attempts should perl be making?

If some OSes use Aristotle’s approach, then we only need *two* attempts,
and Zefram’s plan, although it would have been wonderful if 5.8 had
implemented it, will have to be discarded.

Yeah, you are right. I don't think I fully understand Aristotle's proposal
(though many thanks to him for taking time to explain it to me on IRC), but
it seems pretty good. Now someone just has to write it : )

There are already people using ‘use Mödule’ on OS X. We shouldn’t break
their code.

That probably won't work for the latin-1 range though, and the lack of
normalization on our side, while the OS does it, is and will be
troublesome. But personally, I was thinking of exempting use/require/do for
the time being, for two main reasons; first, properly overriding/encoding
those is non-trivial, and second, it's not a issue that should matter to
people writing Perl; How perl finds its stuff should concert only (mostly)
perl.

More boilerplate for the boilerplate god?

???

Sorry, in-joke.

@p5pRT p5pRT added the Unicode and System Calls Bad interactions of syscalls and UTF-8 label Nov 15, 2019
@xenu xenu removed the Severity Low label Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants