Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent encoding of filenames with UTF8 flag set #15305

Open
p5pRT opened this issue May 6, 2016 · 3 comments
Open

Silent encoding of filenames with UTF8 flag set #15305

p5pRT opened this issue May 6, 2016 · 3 comments
Labels
Unicode and System Calls Bad interactions of syscalls and UTF-8

Comments

@p5pRT
Copy link

p5pRT commented May 6, 2016

Migrated from rt.perl.org#128083 (status was 'open')

Searchable as RT128083$

@p5pRT
Copy link
Author

p5pRT commented May 6, 2016

From @hakonhagland

According to perldoc "perlunicode", section​: "When Unicode Does Not Happen"

https://metacpan.org/pod/distribution/perl/pod/perlunicode.pod#When-Unicode-Does-Not-Happen

Site​: "There are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both in
Perl, but it is not"

Then a set of interfaces are listed, including system() and mkdir(), where
the above statement applies.

Still, it is my experience that the above statement is not strictly
correct, in the sense that Perl will encode (as UTF8) any
Perl string with the UTF8 flag set that is input to these interfaces. For
example, consider system() :

use strict;
use utf8;
use warnings;
use Encode ();

# This sets the UTF8 flag on $str due to "use utf8" pragma and makes $str
consist
# of one char with ordinal value E5
my $str = 'å';

# This clears the UTF8 flag on $str_utf8, makes it a binary string
# of two bytes​: C3 A5
my $str_utf8 = Encode​::encode_utf8( $str );

# Argument to system(), UTF8 flag is set for $arg due to interpolation of
$str
my $arg = "echo -n '$str' | hexdump -C";

# system() scilently encodes $arg as UTF8
system $arg;

# Argument to system(), UTF8 flag is not set for $arg2
my $arg2 = "echo -n '$str_utf8' | hexdump -C";

# system() does nothing with $str_utf8 (since it has no UTF8 flag set)
system $arg2;

The output of the above script is​:

00000000 c3 a5 |..|
00000002
00000000 c3 a5 |..|
00000002

Which confirms that $arg was silently encoded as UTF8 before passed to
/bin/sh.
My concern is that this type of encoding seems to be undocumented (at least
I have not found
any reference to it in the docs), and I wonder what the official
recommendation would be​:

1. Always encode explicitly arguments passed to system(), mkdir(), chdir(),
..., or
2. It is not necessary to encode arguments; one can always trust that the
arguments will be encoded correctly by
  the given function.

If 2) is the recommendation, then perhaps it should be documented somewhere
(assuming I did not miss that part of the docs). The docs should mention
that these interfaces will encode input arguments.
However, note​: I have come across one CPAN modules that *require* the user
to encode the input argument​: File​::Find​::Rule,
(and therefore I assume there probably exists other modules). So if 2) is
recommended there would be some "inconsistency",
in the sense that, mkdir $name would not require $name to be enocded,
but File​::Find​::Rule->new->name( $name ) would require the user to first
encode $name.

Best regards,
Håkon Hægland

@p5pRT
Copy link
Author

p5pRT commented Jun 20, 2016

From @iabyn

On Fri, May 06, 2016 at 08​:27​:18AM -0700, Håkon Hægland wrote​:

# New Ticket Created by Håkon Hægland
# Please include the string​: [perl #128083]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=128083 >

According to perldoc "perlunicode", section​: "When Unicode Does Not Happen"

https://metacpan.org/pod/distribution/perl/pod/perlunicode.pod#When-Unicode-Does-Not-Happen

Site​: "There are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both in
Perl, but it is not"

Then a set of interfaces are listed, including system() and mkdir(), where
the above statement applies.

Still, it is my experience that the above statement is not strictly
correct, in the sense that Perl will encode (as UTF8) any
Perl string with the UTF8 flag set that is input to these interfaces. For
example, consider system() :
[snip]
I have not found
any reference to it in the docs), and I wonder what the official
recommendation would be​:

1. Always encode explicitly arguments passed to system(), mkdir(), chdir(),
..., or
2. It is not necessary to encode arguments; one can always trust that the
arguments will be encoded correctly by
the given function.

Perl's system() etc do not do any form of encoding - they just pass the
physical bytes which make up the string directly to the underlying C
library function as-is, without consideration as to whether the scalar's
UTF8 flag is on or not​: this​:

  $s = "\x80";
  #utf8​::upgrade($s);
  system "echo '$s' | hexdump -C";

outputs​:

  00000000 80 0a

while uncommenting the utf8​::upgrade gives​:

  00000000 c2 80 0a

--
Red sky at night - gerroff my land!
Red sky at morning - gerroff my land!
  -- old farmers' sayings #14

@p5pRT
Copy link
Author

p5pRT commented Jun 20, 2016

The RT System itself - Status changed from 'new' to 'open'

@toddr toddr removed the khw label Oct 25, 2019
@p5pRT p5pRT added the Unicode and System Calls Bad interactions of syscalls and UTF-8 label Nov 15, 2019
@xenu xenu removed the Severity Low label Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unicode and System Calls Bad interactions of syscalls and UTF-8
Projects
None yet
Development

No branches or pull requests

3 participants