Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode symbols damaged in $File::Find::name #10369

Open
p5pRT opened this issue May 9, 2010 · 3 comments
Open

Unicode symbols damaged in $File::Find::name #10369

p5pRT opened this issue May 9, 2010 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented May 9, 2010

Migrated from rt.perl.org#75000 (status was 'open')

Searchable as RT75000$

@p5pRT
Copy link
Author

p5pRT commented May 9, 2010

From perlbug@mvsnet.org.ua

Created by perlbug@mvsnet.org.ua

when executed following code
find(sub {
  return if -d $File​::Find​::name;
  return if ! /$suffixes$/;
  my $name=$File​::Find​::name;
  print 'File​: ';
  print $_;
  print ' Path​: ';
  print $name;
  }, $directory);
with folder containing files named with non-latin characters the output of '$name' contains damaged unicode characters.
If $directory also contains non-latin characters only file names are damaged ($directory part is correct)

Perl Info

Flags:
    category=core
    severity=critical

Site configuration information for perl 5.10.1:

Configured by Debian Project at Sun Apr 11 22:31:36 UTC 2010.

Summary of my perl5 (revision 5 version 10 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32.11-dsa-ia32, archname=i486-linux-gnu-thread-multi
    uname='linux murphy 2.6.32.11-dsa-ia32 #1 smp fri apr 2 10:32:00 cest 2010 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.3', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.10.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1
    gnulibc_version='2.10.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

Locally applied patches:
    DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts
    DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable.
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/extutils_hacks - Various debian-specific ExtUtils changes
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/m68k_thread_stress - http://bugs.debian.org/495826 Disable some threads tests on m68k for now due to missing TLS.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
    DEBPKG:debian/perl_synopsis - http://bugs.debian.org/278323 Rearrange perl.pod
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:debian/use_gdbm - Explicitly link against -lgdbm_compat in ODBM_File/NDBM_File. 
    DEBPKG:fixes/assorted_docs - http://bugs.debian.org/443733 [384f06a] Math::BigInt::CalcEmu documentation grammar fix
    DEBPKG:fixes/net_smtp_docs - http://bugs.debian.org/100195 [rt.cpan.org #36038] Document the Net::SMTP 'Port' option
    DEBPKG:fixes/processPL - http://bugs.debian.org/357264 [rt.cpan.org #17224] Always use PERLRUNINST when building perl modules.
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:fixes/pod2man-index-backslash - http://bugs.debian.org/521256 Escape backslashes in .IX entries
    DEBPKG:debian/disable-zlib-bundling - Disable zlib bundling in Compress::Raw::Zlib
    DEBPKG:fixes/kfreebsd_cppsymbols - http://bugs.debian.org/533098 [3b910a0] Add gcc predefined macros to $Config{cppsymbols} on GNU/kFreeBSD.
    DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default.
    DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl.
    DEBPKG:fixes/kfreebsd-filecopy-pipes - http://bugs.debian.org/537555 [16f708c] Fix File::Copy::copy with pipes on GNU/kFreeBSD
    DEBPKG:fixes/anon-tmpfile-dir - http://bugs.debian.org/528544 [perl #66452] Honor TMPDIR when open()ing an anonymous temporary file
    DEBPKG:fixes/abstract-sockets - http://bugs.debian.org/329291 [89904c0] Add support for Abstract namespace sockets.
    DEBPKG:fixes/hurd_cppsymbols - http://bugs.debian.org/544307 [eeb92b7] Add gcc predefined macros to $Config{cppsymbols} on GNU/Hurd.
    DEBPKG:fixes/autodie-flock - http://bugs.debian.org/543731 Allow for flock returning EAGAIN instead of EWOULDBLOCK on linux/parisc
    DEBPKG:fixes/archive-tar-instance-error - http://bugs.debian.org/539355 [rt.cpan.org #48879] Separate Archive::Tar instance error strings from each other
    DEBPKG:fixes/positive-gpos - http://bugs.debian.org/545234 [perl #69056] [c584a96] Fix \\G crash on first match
    DEBPKG:debian/devel-ppport-ia64-optim - http://bugs.debian.org/548943 Work around an ICE on ia64
    DEBPKG:fixes/trie-logic-match - http://bugs.debian.org/552291 [perl #69973] [0abd0d7] Fix a DoS in Unicode processing [CVE-2009-3626]
    DEBPKG:fixes/hppa-thread-eagain - http://bugs.debian.org/554218 make the threads-shared test suite more robust, fixing failures on hppa
    DEBPKG:fixes/crash-on-undefined-destroy - http://bugs.debian.org/564074 [perl #71952] [1f15e67] Fix a NULL pointer dereference when looking for a DESTROY method
    DEBPKG:fixes/tainted-errno - http://bugs.debian.org/574129 [perl #61976] [be1cf43] fix an errno stringification bug in taint mode
    DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-12 in patchlevel.h


@INC for perl 5.10.1:
    /etc/perl
    /usr/local/lib/perl/5.10.1
    /usr/local/share/perl/5.10.1
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.1:
    HOME=/root
    LANG=ru_RU.UTF-8
    LANGUAGE=ru_RU:ru
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/opt/qtsdk-2010.02/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 10, 2010

From @iabyn

On Sun, May 09, 2010 at 11​:57​:36AM -0700, Vladimir Morozov wrote​:

when executed following code
find(sub {
return if -d $File​::Find​::name;
return if ! /$suffixes$/;
my $name=$File​::Find​::name;
print 'File​: ';
print $_;
print ' Path​: ';
print $name;
}, $directory);
with folder containing files named with non-latin characters the output of '$name' contains damaged unicode characters.
If $directory also contains non-latin characters only file names are damaged ($directory part is correct)

This is a general issue with filenames, and not just restricted to
File​::Find. For example the following shows that the returned filename
string isn't UTF-8 encoded​:

  my $f = "file\x{100}";
  open my $fh, '>', $f or die "open​: $!\n";
  close $fh;

  my ($newf) = <file*>;
  use Devel​::Peek;
  Dump $f;
  Dump $newf;

A workaround (if you know that the filenames are UTF8 encoded) is to
UTF-8 decode the returned filename before using it, e.g.​:

  my $name = $_;
  utf8​::decode($name);

I notice that perltodo.pod has this entry​:

  =head2 Unicode and glob()

  Currently glob patterns and filenames returned from File​::Glob​::glob()
  are always byte strings. See L</"Virtualize operating system access">.

and perlrun.pod has this entry​:

  =item B<-C [I<number/list>]>
  ...
  =for todo
  perltodo mentions Unicode in %ENV and filenames. I guess that these will be
  options e and f (or F).

--
This is a great day for France!
  -- Nixon at Charles De Gaulle's funeral

@p5pRT
Copy link
Author

p5pRT commented May 10, 2010

The RT System itself - Status changed from 'new' to 'open'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants