Report information
Id: 75000
Status: open
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: morozovvs <perlbug [at]>

Operating System: Linux
PatchStatus: (no value)
Severity: medium
  • core
  • OS-interaction
  • Unicode
Perl Version:
  • 5.10.1
  • 5.12.0
  • 5.13.0
Fixed In: (no value)

CC: perlbug [...]
Subject: Unicode symbols damaged in $File::Find::name
Date: Sun, 9 May 2010 21:55:00 +0300 (EEST)
To: perlbug [...]
From: root [...] (root)
text/plain 8.7k
when executed following code find(sub { return if -d $File::Find::name; return if ! /$suffixes$/; my $name=$File::Find::name; print 'File: '; print $_; print ' Path: '; print $name; }, $directory); with folder containing files named with non-latin characters the output of '$name' contains damaged unicode characters. If $directory also contains non-latin characters only file names are damaged ($directory part is correct) Summary of my perl5 (revision 5 version 10 subversion 1) configuration: Platform: osname=linux, osvers=, archname=i486-linux-gnu-thread-multi uname='linux murphy #1 smp fri apr 2 10:32:00 cest 2010 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/local/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.3', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/, so=so, useshrplib=true, gnulibc_version='2.10.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/arm_thread_stress_timeout - Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable. DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/extutils_hacks - Various debian-specific ExtUtils changes DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/m68k_thread_stress - Disable some threads tests on m68k for now due to missing TLS. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/perl_synopsis - Rearrange perl.pod DEBPKG:debian/prune_libs - Prune the list of libraries wanted to what we actually need. DEBPKG:debian/use_gdbm - Explicitly link against -lgdbm_compat in ODBM_File/NDBM_File. DEBPKG:fixes/assorted_docs - [384f06a] Math::BigInt::CalcEmu documentation grammar fix DEBPKG:fixes/net_smtp_docs - [ #36038] Document the Net::SMTP 'Port' option DEBPKG:fixes/processPL - [ #17224] Always use PERLRUNINST when building perl modules. DEBPKG:debian/perlivp - Make perlivp skip include directories in /usr/local DEBPKG:fixes/pod2man-index-backslash - Escape backslashes in .IX entries DEBPKG:debian/disable-zlib-bundling - Disable zlib bundling in Compress::Raw::Zlib DEBPKG:fixes/kfreebsd_cppsymbols - [3b910a0] Add gcc predefined macros to $Config{cppsymbols} on GNU/kFreeBSD. DEBPKG:debian/cpanplus_definstalldirs - Configure CPANPLUS to use the site directories by default. DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl. DEBPKG:fixes/kfreebsd-filecopy-pipes - [16f708c] Fix File::Copy::copy with pipes on GNU/kFreeBSD DEBPKG:fixes/anon-tmpfile-dir - [perl #66452] Honor TMPDIR when open()ing an anonymous temporary file DEBPKG:fixes/abstract-sockets - [89904c0] Add support for Abstract namespace sockets. DEBPKG:fixes/hurd_cppsymbols - [eeb92b7] Add gcc predefined macros to $Config{cppsymbols} on GNU/Hurd. DEBPKG:fixes/autodie-flock - Allow for flock returning EAGAIN instead of EWOULDBLOCK on linux/parisc DEBPKG:fixes/archive-tar-instance-error - [ #48879] Separate Archive::Tar instance error strings from each other DEBPKG:fixes/positive-gpos - [perl #69056] [c584a96] Fix \\G crash on first match DEBPKG:debian/devel-ppport-ia64-optim - Work around an ICE on ia64 DEBPKG:fixes/trie-logic-match - [perl #69973] [0abd0d7] Fix a DoS in Unicode processing [CVE-2009-3626] DEBPKG:fixes/hppa-thread-eagain - make the threads-shared test suite more robust, fixing failures on hppa DEBPKG:fixes/crash-on-undefined-destroy - [perl #71952] [1f15e67] Fix a NULL pointer dereference when looking for a DESTROY method DEBPKG:fixes/tainted-errno - [perl #61976] [be1cf43] fix an errno stringification bug in taint mode DEBPKG:patchlevel - List packaged patches for 5.10.1-12 in patchlevel.h --- @INC for perl 5.10.1: /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl . --- Environment for perl 5.10.1: HOME=/root LANG=ru_RU.UTF-8 LANGUAGE=ru_RU:ru LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/opt/qtsdk-2010.02/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PERL_BADLANG (unset) SHELL=/bin/bash
Subject: Re: [perl #75000] Unicode symbols damaged in $File::Find::name
Date: Mon, 10 May 2010 18:04:27 +0100
To: perl5-porters [...]
From: Dave Mitchell <davem [...]>
text/plain 1.8k
On Sun, May 09, 2010 at 11:57:36AM -0700, Vladimir Morozov wrote: Show quoted text
> when executed following code > find(sub { > return if -d $File::Find::name; > return if ! /$suffixes$/; > my $name=$File::Find::name; > print 'File: '; > print $_; > print ' Path: '; > print $name; > }, $directory); > with folder containing files named with non-latin characters the output of '$name' contains damaged unicode characters. > If $directory also contains non-latin characters only file names are damaged ($directory part is correct)
This is a general issue with filenames, and not just restricted to File::Find. For example the following shows that the returned filename string isn't UTF-8 encoded: my $f = "file\x{100}"; open my $fh, '>', $f or die "open: $!\n"; close $fh; my ($newf) = <file*>; use Devel::Peek; Dump $f; Dump $newf; A workaround (if you know that the filenames are UTF8 encoded) is to UTF-8 decode the returned filename before using it, e.g.: my $name = $_; utf8::decode($name); I notice that perltodo.pod has this entry: =head2 Unicode and glob() Currently glob patterns and filenames returned from File::Glob::glob() are always byte strings. See L</"Virtualize operating system access">. and perlrun.pod has this entry: =item B<-C [I<number/list>]> ... =for todo perltodo mentions Unicode in %ENV and filenames. I guess that these will be options e and f (or F). -- This is a great day for France! -- Nixon at Charles De Gaulle's funeral

