Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open() is not UTF-8-clean #9674

Open
p5pRT opened this issue Mar 6, 2009 · 8 comments
Open

open() is not UTF-8-clean #9674

p5pRT opened this issue Mar 6, 2009 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 6, 2009

Migrated from rt.perl.org#63674 (status was 'open')

Searchable as RT63674$

@p5pRT
Copy link
Author

p5pRT commented Mar 6, 2009

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -lwe '$a="\x{e3}"; utf8​::downgrade($a); open(my $x, ">", "x$a"); utf8​::upgrade($a); open(my $y, ">", "y$a"); opendir(my $d, "."); while(defined($_ = readdir($d))) { print unpack("H*", $_) unless /\A[ -~]*\z/ }'
78e3
79c3a3
$

Apparently open() is using, for the filename, the octet sequence used
to represent the string internally, rather than the character sequence
that the string actually represents. This is a common problem with
XS modules; I'm a bit surprised to see the core get it wrong too.
(Not *very* surprised, though, because the way the SvUTF8 flag was
injected invites this sort of mistake.)

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.0:

Configured by Debian Project at Thu Jan  1 12:43:38 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.26-1-686, archname=i486-linux-gnu-thread-multi
    uname='linux rebekka 2.6.26-1-686 #1 smp mon dec 15 18:15:07 utc 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:
    


@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2009

From lists@der-pepe.de

On Fri, Mar 06, 2009 at 03​:03​:14AM -0800, Zefram wrote​:

Apparently open() is using, for the filename, the octet sequence used
to represent the string internally, rather than the character sequence
that the string actually represents.

It's kind of documented in perlunicode, section "When Unicode Does Not
Happen". I think the problem is that on most systems, there is no
universal standard about which encodings file names should have.

Regards,
Christoph

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2009

From zefram@fysh.org

Christoph Bussenius wrote​:

It's kind of documented in perlunicode, section "When Unicode Does Not
Happen".

It seems to be documented from the point of view of perl 5.6, where
the UTF8 flag existed but didn't really have any semantics. Doesn't
really make sense for 5.8 onwards. "For all of these interfaces Perl
... assumes byte strings both as arguments and results ...". I never
really understood the common usage, in situations like this, of the
word "assume". You can assume that I've passed you a byte string,
but if that isn't actually true then you're going to run into trouble,
of a nature that the document has completely failed to specify.

But in this case, in fact, I *am* supplying you with a byte string,
as far as the Perl language is concerned. The same byte string twice,
but you're treating it differently dependinng on what should be an
insignificant implementation detail. Assumption of something that's
true doesn't really cover the wrongitude that's occurring here.

     I think the problem is that on most systems\, there is no

universal standard about which encodings file names should have.

Not at all. Differing standards about encodings to be used with filenames
are a reason to have pluggable encoding layers on filesystems, along the
lines of the existing pluggable encoding layers on I/O within a file.
If my filenames were consistently UTF-8 mangled then, well, I'd probably
still complain, but it would be a complaint about unwanted UTF-8 encoding.
This complaint is about *inconsistent* behaviour. The same input is being
subjected to different encoding based on something I should never see.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2009

From tchrist@perl.com

For a real fun time (NOT!), try this on various operating systems
and various filesystems. And remind me how to turn off case
insensitivity on Darwin, darn it. :-)

Some things I understand, others are very mysterious, like

  Use of uninitialized value $count in printf at mkfiles line 77.

Yet it clearly prints the correct file number. So odd.

Make sure to run this on both a non utf8 aware f/s and also
on one that it is.

Remarkable!

--tom

#!/usr/bin/perl
#
# mk_utf_files - try to make "synonymous" utf8 files
#
# Tom Christiansen
# Thu Mar 12 02​:56​:03 MDT 2009
#

BEGIN {
  binmode(STDOUT, "​:utf8") || die "can't binmode STDOUT to utf8​: $!";
  binmode(STDERR, "​:utf8") || die "can't binmode STDERR to utf8​: $!";
  $| = 1;
}

use strict;
use warnings;
#use warnings FATAL => qw[ all ];

use Fcntl qw [ O_WRONLY O_CREAT O_EXCL O_EXLOCK ];

my $oflags = ( O_WRONLY|O_CREAT|O_EXCL|O_EXLOCK );

my @​emigres = (
  [ qw( E9 6D 69 67 72 E9 ) ],
  [ qw( E9 6D 69 67 72 65 301 ) ],
  [ qw( 65 301 6D 69 67 72 E9 ) ],
  [ qw( 65 301 6D 69 67 72 65 301 ) ],
);

for my $ary_ref (@​emigres) {
  for (@​$ary_ref) {
  $_ = hex($_);
  }
}

my @​fnames;

$fnames[0] = pack("C*", @​{ $emigres[0] } );
for my $i (0 .. $#emigres) {
  $fnames[$i+1] = pack("U*", @​{ $emigres[$i] } );
}

my $len = @​fnames;

push @​fnames => (
  (map { ucfirst } @​fnames),
  (map { uc } @​fnames),
);

my $i = 0;
for my $name (@​fnames) {
  my $am_utf8 = ($i % $len) != 0;

  print "\n------\n" unless $am_utf8;

  my $fmt = $am_utf8 ? "C0C*" : "U0U*";
  my $enc = $am_utf8 ? "utf8" : "encoding(latin1)";

  emit($name, $i);

  ### next;

  unless ( sysopen(FH, $name, $oflags, 0666) ) {
  warn sprintf("couldn't CREAT file %d​: %s %vX​: %s [%d]\n", $i, $name, $name, $!, $!);
  next;
  }

  binmode(FH, "​:$enc") || die "couldn't binmode $name to $enc​: $!";

  select(FH);
  $| = 1;
  emit($name);
  close(FH)
  || warn sprintf("couldn't close %s %vX​: %s [%d]\n", $name, $name, $!, $!);

} continue {
  select(STDOUT);
  $i++;
}

sub emit {
  my ($string, $count) = @​_;
  printf "%s file %d\n\tChars = %vX\n", $string, $count, $string;
  {
  use bytes;
  printf "\tBytes = %vX\n", $string, $string;
  }
}

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2009

From lists@der-pepe.de

On Thu, Mar 12, 2009 at 03​:03​:01AM -0600, Tom Christiansen wrote​:

Some things I understand, others are very mysterious, like

Use of uninitialized value $count in printf at mkfiles line 77\.

Yet it clearly prints the correct file number. So odd.

I think I can answer this part of your mail​:

The file number is being printed correctly because of line 60,

  emit($name, $i);

However the warning stems from line 73,

  emit($name);

which happens after select(FH), which is why the missing number cannot
be seen on the terminal.

Regards,
Christoph

@p5pRT
Copy link
Author

p5pRT commented Dec 25, 2012

From victor@vsespb.ru

This code

perl -lwe '$a="\x{e3}"; utf8​::downgrade($a); open(my $x, ">", "x$a");
utf8​::upgrade($a); open(my $y, ">", "y$a"); opendir(my $d, ".");
while(defined($_ = readdir($d))) { print unpack("H*", $_) unless /\A[
-~]*\z/ }'

actually sends different octets to open().
see here​:

perl -MDevel​::Peek -lwe '$a="\x{e3}"; utf8​::downgrade($a); print
Dump("x$a"); utf8​::upgrade($a); print Dump("y$a");'

SV = PV(0x1b45c68) at 0x1b694e8
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK)
  PV = 0x1b5fad0 "x\343"\0
  CUR = 2
  LEN = 8

SV = PV(0x1b45b58) at 0x1b730a0
  REFCNT = 1
  FLAGS = (PADTMP,POK,pPOK,UTF8)
  PV = 0x1b63b90 "y\303\243"\0 [UTF8 "y\x{e3}"]
  CUR = 3
  LEN = 8

On Fri Mar 06 03​:03​:13 2009, zefram@​fysh.org wrote​:

This is a bug report for perl from zefram@​fysh.org,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

$ perl -lwe '$a="\x{e3}"; utf8​::downgrade($a); open(my $x, ">",
"x$a"); utf8​::upgrade($a); open(my $y, ">", "y$a"); opendir(my $d,
"."); while(defined($_ = readdir($d))) { print unpack("H*", $_)
unless /\A[ -~]*\z/ }'
78e3
79c3a3
$

Apparently open() is using, for the filename, the octet sequence used
to represent the string internally, rather than the character sequence
that the string actually represents. This is a common problem with
XS modules; I'm a bit surprised to see the core get it wrong too.
(Not *very* surprised, though, because the way the SvUTF8 flag was
injected invites this sort of mistake.)

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=medium
---
Site configuration information for perl 5.10.0​:

Configured by Debian Project at Thu Jan 1 12​:43​:38 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0)
configuration​:
Platform​:
osname=linux, osvers=2.6.26-1-686, archname=i486-linux-gnu-thread-
multi
uname='linux rebekka 2.6.26-1-686 #1 smp mon dec 15 18​:15​:07 utc
2008 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.10.0
-Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl
-Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio
-Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib
-Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
optimize='-O2 -g',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -I/usr/local/include'
ccversion='', gccversion='4.3.2', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries​:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /usr/lib64
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.7.so, so=so, useshrplib=true,
libperl=libperl.so.5.10.0
gnulibc_version='2.7'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches​:

---
@​INC for perl 5.10.0​:
/etc/perl
/usr/local/lib/perl/5.10.0
/usr/local/share/perl/5.10.0
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.10
/usr/share/perl/5.10
/usr/local/lib/site_perl
.

---
Environment for perl 5.10.0​:
HOME=/home/zefram
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/zefram/pub/i686-pc-linux-

gnu/bin​:/home/zefram/pub/common/bin​:/usr/bin​:/usr/X11R6/bin​:/bin​:/usr/local/bin​:/usr/games

PERL\_BADLANG \(unset\)
SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Dec 25, 2012

From victor@vsespb.ru

I did not checked, but it looks to me that this script

mk_utf_files - try to make "synonymous" utf8 files

is creating filenames in different "Normalization" forms (NFC/NFD).
so it's bit not related to this bug.

On Thu Mar 12 02​:03​:51 2009, tom christiansen wrote​:

For a real fun time (NOT!), try this on various operating systems
and various filesystems. And remind me how to turn off case
insensitivity on Darwin, darn it. :-)

Some things I understand, others are very mysterious, like

Use of uninitialized value $count in printf at mkfiles line 77\.

Yet it clearly prints the correct file number. So odd.

Make sure to run this on both a non utf8 aware f/s and also
on one that it is.

Remarkable!

--tom

#!/usr/bin/perl
#
# mk_utf_files - try to make "synonymous" utf8 files
#
# Tom Christiansen
# Thu Mar 12 02​:56​:03 MDT 2009
#

BEGIN {
binmode(STDOUT, "​:utf8") || die "can't binmode STDOUT to utf8​:
$!";
binmode(STDERR, "​:utf8") || die "can't binmode STDERR to utf8​:
$!";
$| = 1;
}

use strict;
use warnings;
#use warnings FATAL => qw[ all ];

use Fcntl qw [ O_WRONLY O_CREAT O_EXCL O_EXLOCK ];

my $oflags = ( O_WRONLY|O_CREAT|O_EXCL|O_EXLOCK );

my @​emigres = (
[ qw( E9 6D 69 67 72 E9 ) ],
[ qw( E9 6D 69 67 72 65 301 ) ],
[ qw( 65 301 6D 69 67 72 E9 ) ],
[ qw( 65 301 6D 69 67 72 65 301 ) ],
);

for my $ary_ref (@​emigres) {
for (@​$ary_ref) {
$_ = hex($_);
}
}

my @​fnames;

$fnames[0] = pack("C*", @​{ $emigres[0] } );
for my $i (0 .. $#emigres) {
$fnames[$i+1] = pack("U*", @​{ $emigres[$i] } );
}

my $len = @​fnames;

push @​fnames => (
(map { ucfirst } @​fnames),
(map { uc } @​fnames),
);

my $i = 0;
for my $name (@​fnames) {
my $am_utf8 = ($i % $len) != 0;

print "\\n\-\-\-\-\-\-\\n" unless $am\_utf8;

my $fmt = $am\_utf8 ? "C0C\*" : "U0U\*";
my $enc = $am\_utf8 ? "utf8" : "encoding\(latin1\)";


emit\($name\, $i\);

\#\#\# next;

unless \( sysopen\(FH\, $name\, $oflags\, 0666\) \) \{
    warn sprintf\("couldn't CREAT file %d​: %s %vX​: %s \[%d\]\\n"\, $i\,

$name, $name, $!, $!);
next;
}

binmode\(FH\, "​:$enc"\) || die "couldn't binmode $name to $enc​: $\!";

select\(FH\);
$| = 1;
emit\($name\);
close\(FH\)
    || warn sprintf\("couldn't close %s %vX​: %s \[%d\]\\n"\, $name\,

$name, $!, $!);

} continue {
select(STDOUT);
$i++;
}

sub emit {
my ($string, $count) = @​_;
printf "%s file %d\n\tChars = %vX\n", $string, $count, $string;
{
use bytes;
printf "\tBytes = %vX\n", $string, $string;
}
}

@toddr toddr removed the khw label Oct 25, 2019
@p5pRT p5pRT added the Unicode and System Calls Bad interactions of syscalls and UTF-8 label Nov 15, 2019
nschloe pushed a commit to live-clones/lintian that referenced this issue Apr 15, 2020
…ile::Path. (See: #956233, #956723)

This provides relief from runtime errors in Lintian, but does not
solve the bugs. It merely makes Lintian useable again.

The offending packages sphinx and supysonic no longer abort with
runtime errors.

Due to a bug in Perl, strings must be "downgraded" before system calls
such as stat or open. It is the proper fix [1][2], and should happen
in Perl. We simply do so here as triage.

[1] Perl/perl5#10550
[2] Perl/perl5#9674

More comprehensive fixes for both bugs are in the works.
@xenu xenu removed the affects-5.10 label Nov 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants