Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename, chroot etc. ignore internal encoding #10623

Open
p5pRT opened this issue Sep 12, 2010 · 14 comments
Open

rename, chroot etc. ignore internal encoding #10623

p5pRT opened this issue Sep 12, 2010 · 14 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 12, 2010

Migrated from rt.perl.org#77798 (status was 'open')

Searchable as RT77798$

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2010

From perlbug@plan9.de

Created by perlbug@plan9.de

This snippet calls rename with two different paths, even though the same
string is passed to rename.

  perl -e 'my $x = chr 200; rename $x,0; utf8​::encode $x; rename $x,0'

The fact that the internal (basically invisible to a perl program)
encoding changes should not change semantics of I/O functions.

The solution is to use the equivalent of SvPVbyte, not SvPV, when passing
paths (or other 8b-it data) to posix functions.

A cursory examination of pp_sys shows that at least backtick, open,
dbmopen, sysopen, truncate, bind, setsockopt, getsockopt, getpeername,
stat, chdir, chroot, link, readlink, mkdir, rmdir, opendir, system, exec,
gethost*, getproto*, getserv* etc. are affected (I stopped looking).

All those functions silently throw away the crucial information of how
bytes are encoded in a string. As modules and programs using unicode
become more common, this problem will become a major issue.

(When in doubt, it always helps to review the discussion about crypt()
which was fixed during 5.006 times).

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.10.1:

Configured by Marc Lehmann at Wed May  5 10:53:04 CEST 2010.

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=2.6.26-2-amd64, archname=amd64-linux
    uname='linux cerebro 2.6.26-2-amd64 #1 smp thu nov 5 02:23:12 utc 2009 x86_64 gnulinux '
    config_args='-Duselargefiles -Dxxxxuse64bitint -Uuse64bitall -Dusemymalloc=n -Dcc=gcc -Dccflags=-ggdb -gdwarf-2 -g3 -Dcppflags=-DPERL_ARENA_SIZE=65536 -D_GNU_SOURCE -I/opt/include -Doptimize=-O6 -funroll-loops -fno-strict-aliasing -Dcccdlflags=-fPIC -Dldflags=-L/opt/perl/lib -L/opt/lib -Dlibs=-ldl -lm -lcrypt -Darchname=amd64-linux -Dprefix=/opt/perl -Dprivlib=/opt/perl/lib/perl5 -Darchlib=/opt/perl/lib/perl5 -Dvendorprefix=/opt/perl -Dvendorlib=/opt/perl/lib/perl5 -Dvendorarch=/opt/perl/lib/perl5 -Dsiteprefix=/opt/perl -Dsitelib=/opt/perl/lib/perl5 -Dsitearch=/opt/perl/lib/perl5 -Dsitebin=/opt/perl/bin -Dman1dir=/opt/perl/man/man1 -Dman3dir=/opt/perl/man/man3 -Dsiteman1dir=/opt/perl/man/man1 -Dsiteman3dir=/opt/perl/man/man3 -Dman1ext=1 -Dman3ext=3 -Dpager=/usr/bin/less -Uafs -Uusesfio -Uusenm -Uuseshrplib -Ud_dosuid -Dusethreads=undef -Duse5005threads=undef -Duseithreads=undef -Dusemultiplicity=undef -Demail=perl-binary@plan9.de -Dcf_email=perl-binary@plan9.de -Dcf_by=Marc Lehmann -Dlocincpth=/opt/perl/include /opt/include -Dmyhostname=localhost -Dmultiarch=undef -Dbin=/opt/perl/bin -Dxxxusedevel -DxxxDEBUGGING -Dxxxuse_debugging_perl -Dxxxuse_debugmalloc -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -fstack-protector -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O6 -funroll-loops -fno-strict-aliasing',
    cppflags='-DPERL_ARENA_SIZE=65536 -D_GNU_SOURCE -I/opt/include -ggdb -gdwarf-2 -g3 -fno-strict-aliasing -pipe -fstack-protector -I/opt/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-L/opt/perl/lib -L/opt/lib -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-ldl -lm -lcrypt
    perllibs=-ldl -lm -lcrypt
    libc=/lib/libc-2.10.2.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.10.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O6 -funroll-loops -fno-strict-aliasing -L/opt/perl/lib -L/opt/lib -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.10.1:
    /root/src/sex
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    /opt/perl/lib/perl5
    .


Environment for perl 5.10.1:
    HOME=/root
    LANG (unset)
    LANGUAGE (unset)
    LC_CTYPE=en_US.UTF-8
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/root/s2:/root/s:/opt/bin:/opt/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11/bin:/usr/games:/usr/local/bin:/usr/local/sbin:/root/pserv:.
    PERL5LIB=/root/src/sex
    PERL5_CPANPLUS_CONFIG=/root/.cpanplus/config
    PERLDB_OPTS=ornaments=0
    PERL_ANYEVENT_DBI_TESTS=1
    PERL_ANYEVENT_EDNS0=1
    PERL_ANYEVENT_NET_TESTS=1
    PERL_ANYEVENT_PROTOCOLS=ipv4,ipv6
    PERL_ANYEVENT_STRICT=1
    PERL_BADLANG (unset)
    PERL_UNICODE=E
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2010

From @ikegami

On Sun, Sep 12, 2010 at 5​:15 AM, perlbug@​plan9.de <perlbug-followup@​perl.org

wrote​:

# New Ticket Created by perlbug@​plan9.de
# Please include the string​: [perl #77798]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=77798 >

This is a bug report for perl from perlbug@​plan9.de,
generated with the help of perlbug 1.39 running under perl 5.10.1.

-----------------------------------------------------------------
[Please describe your issue here]

This snippet calls rename with two different paths, even though the same
string is passed to rename.

perl -e 'my $x = chr 200; rename $x,0; utf8​::encode $x; rename $x,0'

$x and $x after utf8​::encode($x) are not the same string. (They're not even
the same length.)

But there is a bug here. $x after utf8​::upgrade and $x after utf8​::downgrade
are the same string, but they're not treated as such.

$ perl -e'$_=chr(0xE9); utf8​::upgrade($_); rename "a",$_'
$ perl -e'$_=chr(0xE9); utf8​::downgrade($_); rename "b",$_'
$ ls
é
?

The solution is to use the equivalent of SvPVbyte, not SvPV, when passing

Correct.

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2010

From schmorp@schmorp.de

On Sun, Sep 12, 2010 at 01​:23​:42PM -0400, Eric Brine <ikegami@​adaelis.com> wrote​:

Sorry for the late reply, but, again, I never received your mail becasue
it wasn't directed at me, so I just saw it "by accident" by looking at
p5p.

This snippet calls rename with two different paths, even though the same
string is passed to rename.

perl -e 'my $x = chr 200; rename $x,0; utf8​::encode $x; rename $x,0'

$x and $x after utf8​::encode($x) are not the same string. (They're not even
the same length.)

Yes, while condensing the testcase as much as possible I accidentally
swapped upgrade with encode. In any case, the problem remains the same,
namely perl ignoring the utf-8 flag for many of it's system interfaces,
and the ExtUtils typemap, which breaks many xs modules.

--
  The choice of a Deliantra, the free code+content MORPG
  -----==- _GNU_ http​://www.deliantra.net
  ----==-- _ generation
  ---==---(_)__ __ ____ __ Marc Lehmann
  --==---/ / _ \/ // /\ \/ / schmorp@​schmorp.de
  -=====/_/_//_/\_,_/ /_/\_\

@FGasper
Copy link
Contributor

FGasper commented Feb 26, 2021

What if this were solved by creating a sysbinmode built-in that served the same purpose as binmode for filehandles?

That way Perl applications could set an I/O layer for rename et al. And Perl’s default would change to the same behaviour as filehandles—basically SvPVbyte.

@Leont
Copy link
Contributor

Leont commented Feb 26, 2021

What if this were solved by creating a sysbinmode built-in that served the same purpose as binmode for filehandles?

What scope would that have?

@FGasper
Copy link
Contributor

FGasper commented Feb 27, 2021

What scope would that have?

Global, I guess? Could alternatively make it a pragma, e.g., use sysbinmode "UTF-8".

@Leont
Copy link
Contributor

Leont commented Feb 28, 2021

I think global would be wrong, because that means code can't make any assumptions of its own anymore. I immediately recall php code full of "if add_slashes is globally enabled do this, other wise do that" code.

@FGasper
Copy link
Contributor

FGasper commented Feb 28, 2021

@Leont Global, yes, feels wrong.

But if I could:

use sysbinmode ':utf8';

my $foo = "é";
exec 'echo', $foo;

… and have that auto-encode the same way binmode $fh, ':utf8' does, that would seem a reasonable fix?

@xenu
Copy link
Member

xenu commented Feb 28, 2021

See also #17094 (comment) (the ticket is about win32, but tony's proposal is for all platforms).

@FGasper
Copy link
Contributor

FGasper commented Feb 28, 2021

@xenu For myself, I actually want to go the other way: SvPVbyte rather than SvPVutf8.

@FGasper
Copy link
Contributor

FGasper commented Feb 28, 2021

@Leont @xenu ^^ Thoughts on the above proof-of-concept?

@ikegami
Copy link
Contributor

ikegami commented Feb 28, 2021

On unix systems, file names are composed of arbitrary bytes, which two having specific values: 0x00 reserved to denote end of string, and 0x2F directory separator. ("/" is 0x2F in EBCDIC encodings too!) There's no guarantee of being UTF-8 or some other encoding, no matter what the locale says.

On Windows file systems, file names are sequences of arbitrary 16-bit values expected to be UTF-16le, but it's surely possible to have unmatched surrogates and invalid characters such as 0xFFFF.

If we want Perl to be able to round-trip any file name (e.g. readdir -> rename), there are two options.

  1. Return/accept arbitrary sequences of 8-bit values (unix) or 16-bit values (Windows), no matter how they are store (upgraded or downgraded).

  2. Decode/encode returned/accepted file names (using locale in unix) in such a way that any sequence can be created. See this for an example of such a system.

Current status:

  • Unix: Returns/accepts arbitrary sequences of 8-bit values. This means that any file name can be round-tripped. The internal storage of the string rather than the string itself.

  • Windows: Returns/accepts the file name encoded using the system's Active/ANSI Code Page. Most file names can't be returned or accepted by Perl (without using modules instead of builtin functions).

    I'm trying to get the ACP for Perl's process changed to 65001 (UTF-8) for Strawberry Perl. See the issue I raised.

use sysbinmode ':utf8';

I would love to see decoded files names (option 2 above) , and a pragma would be required to do so, but having to provide an encoding is bad. The correct encoding should be used. The pragma could allow one to specify errors.

  • encoded the error in a way that it can be recreated (Would allow file names to be round-tripped.)
  • replace silently (Safer, but would prevent round-tripping.)
  • replace noisily (ditto)
  • die
  • callback

@FGasper
Copy link
Contributor

FGasper commented Mar 4, 2021

This deals with the problem of upgraded/downgraded strings meaning different filesystem paths:
https://metacpan.org/pod/Sys::Binmode

It doesn’t address Windows, but AFAIK it doesn’t worsen the Windows situation, either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants