Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-e ignores the UTF8 flag #10550

Open
p5pRT opened this issue Aug 15, 2010 · 13 comments
Open

-e ignores the UTF8 flag #10550

p5pRT opened this issue Aug 15, 2010 · 13 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 15, 2010

Migrated from rt.perl.org#77242 (status was 'open')

Searchable as RT77242$

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2010

From @cpansprout

#!perl -l
$fn = "\xe2\x80\x99";
`touch \Q$fn`;
print -e $fn; # prints 1
print -e substr "$fn\x{100}", -1; # prints nothing

This can happen easily if the file name comes from a URI (URIs always represent byte sequences) and the URI happened to come from a UTF-8 web page.


Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.13.3​:

Configured by sprout at Thu Aug 12 17​:53​:37 PDT 2010.

Summary of my perl5 (revision 5 version 13 subversion 3 patch v5.13.3-193-g798ae1b) configuration​:
  Snapshot of​: 798ae1b
  Platform​:
  osname=darwin, osvers=10.4.0, archname=darwin-2level
  uname='darwin pint.local 10.4.0 darwin kernel version 10.4.0​: fri apr 23 18​:28​:53 pdt 2010; root​:xnu-1504.7.4~1release_i386 i386 '
  config_args='-de -Dusedevel'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O3',
  cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-ldbm -ldl -lm -lutil -lc
  perllibs=-ldl -lm -lutil -lc
  libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -fstack-protector'

Locally applied patches​:
 


@​INC for perl 5.13.3​:
  /usr/local/lib/perl5/site_perl/5.13.3/darwin-2level
  /usr/local/lib/perl5/site_perl/5.13.3
  /usr/local/lib/perl5/5.13.3/darwin-2level
  /usr/local/lib/perl5/5.13.3
  /usr/local/lib/perl5/site_perl
  .


Environment for perl 5.13.3​:
  DYLD_LIBRARY_PATH (unset)
  HOME=/Users/sprout
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/local/bin​:/usr/X11/bin​:/usr/local/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2010

From @ikegami

On Sun, Aug 15, 2010 at 5​:11 PM, Father Chrysostomos
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by  Father Chrysostomos
# Please include the string​:  [perl #77242]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=77242 >

#!perl -l
$fn = "\xe2\x80\x99";
`touch \Q$fn`;
print -e $fn; # prints 1
print -e substr "$fn\x{100}", -1; # prints nothing

This can happen easily if the file name comes from a URI (URIs always represent byte sequences) and the URI happened to come from a UTF-8 web page.

Bad demo. The arguments for substr are wrong. The following
demonstrates the bug​:

/#!perl -l
$fn = "\xe2\x80\x99";
`touch \Q$fn`;
print -e $fn; # prints 1
print -e substr "$fn\x{100}", 0, -1; # prints nothing

So does this simpler code​:

#!perl -l
$fn = "\xe2\x80\x99";
`touch \Q$fn`;
print -e $fn; # prints 1
utf8​::upgrade($fn);
print -e $fn; # prints nothing

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 5, 2017

From dbook@cpan.org

Created by dbook@cpan.org

File test operators like -e and -f appear to be using the
internal representation of the file path passed to them, or are
otherwise broken by upgrading the string when it contains
non-ASCII bytes. It seems unreasonable to expect users to
downgrade any file paths passed to these operators just in case
it was accidentally upgraded; the expected behavior would be to
use the perl-level representation of the string. Reproduction
case below, or at https://perlbot.pl/raw/l2vc6w .

use strict;
use warnings;

use Encode 'encode';
use File​::Spec​::Functions 'catfile';
use File​::Temp;
use Test​::More;

my $filename = "t\x{eb}st";

my $dir = File​::Temp->newdir;

my $filepath = catfile $dir, encode('UTF-8', $filename);
open my $fh, '>', $filepath or die "Failed to create file $filepath​: $!";
close $fh;

ok -e $filepath, "File $filepath exists";
utf8​::upgrade $filepath;
ok -e $filepath, "File $filepath still exists"; # this fails

done_testing;

__END__
ok 1 - File /tmp/GnS7A2ORvG/tëst exists
not ok 2 - File /tmp/GnS7A2ORvG/tëst still exists
# Failed test 'File /tmp/GnS7A2ORvG/tëst still exists'
# at test_filename.pl line 19.
1..2
# Looks like you failed 1 test of 2.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.26.0:

Configured by grinnz at Tue May 30 17:37:21 EDT 2017.

Summary of my perl5 (revision 5 version 26 subversion 0) configuration:
   
  Platform:
    osname=linux
    osvers=4.8.13-100.fc23.x86_64
    archname=x86_64-linux
    uname='linux home.grinnz.com 4.8.13-100.fc23.x86_64 #1 smp fri dec 9 14:51:40 utc 2016 x86_64 x86_64 x86_64 gnulinux '
    config_args='-Dprefix=/home/grinnz/.plenv/versions/5.26.0 -de -Dusedevel -A'eval:scriptdir=/home/grinnz/.plenv/versions/5.26.0/bin''
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-O2'
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='5.3.1 20160406 (Red Hat 5.3.1-6)'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.22.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.22'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'

Locally applied patches:
    Devel::PatchPerl 1.38


@INC for perl 5.26.0:
    /home/grinnz/.plenv/versions/5.26.0/lib/perl5/site_perl/5.26.0/x86_64-linux
    /home/grinnz/.plenv/versions/5.26.0/lib/perl5/site_perl/5.26.0
    /home/grinnz/.plenv/versions/5.26.0/lib/perl5/5.26.0/x86_64-linux
    /home/grinnz/.plenv/versions/5.26.0/lib/perl5/5.26.0


Environment for perl 5.26.0:
    HOME=/home/grinnz
    LANG=en_US.utf8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/grinnz/.plenv/versions/5.26.0/bin:/home/grinnz/.plenv/libexec:/home/grinnz/.plenv/plugins/perl-build/bin:/home/grinnz/.plenv/plugins/plenv-contrib/bin:/home/grinnz/.plenv/shims:/home/grinnz/.plenv/bin:/usr/local/cuda/bin:/home/grinnz/.plenv/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin
    PERL_BADLANG (unset)
    SHELL=/usr/bin/fish

@p5pRT
Copy link
Author

p5pRT commented Oct 5, 2017

From @Grinnz

This is a duplicate of https://rt.perl.org/Public/Bug/Display.html?id=77242
.
-Dan

@p5pRT
Copy link
Author

p5pRT commented Oct 5, 2017

The RT System itself - Status changed from 'new' to 'open'

@toddr toddr removed the khw label Oct 25, 2019
@p5pRT p5pRT added the Unicode and System Calls Bad interactions of syscalls and UTF-8 label Nov 15, 2019
nschloe pushed a commit to live-clones/lintian that referenced this issue Apr 15, 2020
…ile::Path. (See: #956233, #956723)

This provides relief from runtime errors in Lintian, but does not
solve the bugs. It merely makes Lintian useable again.

The offending packages sphinx and supysonic no longer abort with
runtime errors.

Due to a bug in Perl, strings must be "downgraded" before system calls
such as stat or open. It is the proper fix [1][2], and should happen
in Perl. We simply do so here as triage.

[1] Perl/perl5#10550
[2] Perl/perl5#9674

More comprehensive fixes for both bugs are in the works.
@xenu xenu removed the affects-5.13 label Nov 19, 2021
@khwilliamson khwilliamson self-assigned this Apr 6, 2022
@khwilliamson
Copy link
Contributor

I believe this ticket is unfixable..

Take
$fn = "\xe2\x80\x99"; touch \Q$fn; print -e $fn; # prints 1

These bytes comprise U+2019. If one says
utf8::decode($fn)
the result is
PV = 0x561180791538 "\xE2\x80\x99"\0 [UTF8 "\x{2019}"]
and `print -e $fn; # prints 1``

I believe that is the correct behavior.
But the ticket didn't do a decode. Instead it did utf8::upgrade
That yields
PV = 0x561fd8262438 "\xC3\xA2\xC2\x80\xC2\x99"\0 [UTF8 "\x{e2}\x{80}\x{99}"]

And print -e $fn; # prints 0

What the tickets are effectively asking for is for two different UTF8 strings to evaluate to the same thing. I don't think that is advisable. One needs to choose one or the other interpretation, and I believe the one we already have chosen is the better option.

@Grinnz
Copy link
Contributor

Grinnz commented Apr 6, 2022

It's not asking for that. It is asking for two strings with the same contents to evaluate to the same thing. This is an instance of the unicode bug fixed by https://metacpan.org/pod/Sys::Binmode.

@ikegami
Copy link
Contributor

ikegami commented Apr 6, 2022

Like Grinnz said, it's an instance of The Unicode Bug. Every builtin that deals with files still suffer from this bug.

@Grinnz
Copy link
Contributor

Grinnz commented Apr 6, 2022

To be specific to your examples:

Take $fn = "\xe2\x80\x99"; touch \Q$fn; print -e $fn; # prints 1

These bytes comprise U+2019. If one says utf8::decode($fn) the result is PV = 0x561180791538 "\xE2\x80\x99"\0 [UTF8 "\x{2019}"] and `print -e $fn; # prints 1``

This is one example of the bug. The contents of that string is "\x{2019}", while the name of the file is "\xE2\x80\x99". Correct functionality would behave like print "\x{2019}"; warn about wide characters (since a filename cannot contain that wide character) and fall back to the internally stored bytes. Instead it ignores the logical contents of the string and uses the internal bytes directly, which specifically results in wrong behavior with no warnings when an upgraded string contains characters which are neither ASCII nor wide (\x80-\xFF).

I believe that is the correct behavior. But the ticket didn't do a decode. Instead it did utf8::upgrade That yields PV = 0x561fd8262438 "\xC3\xA2\xC2\x80\xC2\x99"\0 [UTF8 "\x{e2}\x{80}\x{99}"]

And print -e $fn; # prints 0

This is the other direction of this bug. This string is logically identical to the original string, as it was only subjected to an upgrade operation. If you printed it, the output would be the same bytes as with the original string. It has length of 3, just like the original string. And yet, used as a filename, it silently has different behavior.

@khwilliamson
Copy link
Contributor

OK, I understand your points better now.

Do no file systems allow UTF-8 names? What about Windows? Are their filenames not UTF16?

@Grinnz
Copy link
Contributor

Grinnz commented Apr 6, 2022

Different OS/filesystem behavior is one of the biggest reasons it's hard to fix this. Filenames on unixlike systems are bytes, always - encoding is not enforced, just assumed via locale environment, except on MacOS which also normalizes the UTF-8 in the filesystem. On Windows they are stored in UTF-16 (I believe) but I am not familiar with how Perl's functions interact with that.

@Grinnz
Copy link
Contributor

Grinnz commented Apr 6, 2022

The main functional issue is the inconsistency in behavior, not the encoding; but the differences in filesystem behavior makes the encoding an unfortunately relevant issue to any attempts at fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants