Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdir() return value should be documented as always downgraded #13183

Open
p5pRT opened this issue Aug 20, 2013 · 22 comments
Open

readdir() return value should be documented as always downgraded #13183

p5pRT opened this issue Aug 20, 2013 · 22 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 20, 2013

Migrated from rt.perl.org#119395 (status was 'open')

Searchable as RT119395$

@p5pRT
Copy link
Author

p5pRT commented Aug 20, 2013

From victor@vsespb.ru

readdir() return value should be documented as always downgraded. otherwise
there is an inconsistency in file functions workflow.

assumptions​:
1. readdir returns binary strings
2. binary data can be (randomly) upgraded or downgraded by 3rd party code
(it's hard for programmer to control this)
3. Functions working with filenames ( here is list
http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen )
simply ignore UTF-8 flag
4. Programmer might want to work with filenames as with binary strings (not
character strings) because filesystem encoding unknown/hard to detect.

example​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
die $_ unless -e $_;
}

above code fails if binary strings were upgraded and there are
non-ASCII-7bit filenames in current directory.

line "$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);" represent the
fact that binary string was randomly downgraded or upgraded (perhaps
after concatenation with ASCII string with UTF-8 bit, or maybe after
serialization/deserialization)

solution​: programmer should explicitly call utf8​::downgrade($_) before "-e"
test​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
utf8​::downgrade($_);
die $_ unless -e $_;
}

but that is correct solution only if we are sure that readdir returns
downgraded byte strings.

same probably true for readlink and @​ARGV

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From @b2gills

On Tue, Aug 20, 2013 at 5​:07 PM, Victor Efimov
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Victor Efimov
# Please include the string​: [perl #119395]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119395 >

readdir() return value should be documented as always downgraded. otherwise
there is an inconsistency in file functions workflow.

assumptions​:
1. readdir returns binary strings
2. binary data can be (randomly) upgraded or downgraded by 3rd party code
(it's hard for programmer to control this)
3. Functions working with filenames ( here is list
http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen )
simply ignore UTF-8 flag
4. Programmer might want to work with filenames as with binary strings (not
character strings) because filesystem encoding unknown/hard to detect.

example​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
die $_ unless -e $_;
}

above code fails if binary strings were upgraded and there are
non-ASCII-7bit filenames in current directory.

line "$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);" represent the
fact that binary string was randomly downgraded or upgraded (perhaps
after concatenation with ASCII string with UTF-8 bit, or maybe after
serialization/deserialization)

solution​: programmer should explicitly call utf8​::downgrade($_) before "-e"
test​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
utf8​::downgrade($_);
die $_ unless -e $_;
}

but that is correct solution only if we are sure that readdir returns
downgraded byte strings.

same probably true for readlink and @​ARGV

We should DEFINITELY document that filesystem names
should be treated as binary data.
I'm fairly certain we support at least one system
that can have two or more files with the same normalized name.

This can become problematic if you normalize a UTF8 text
file, and use a string from it to determine the name of another
file to read.

So readdir should only return strings with the UTF8 flag set
if the current operating system ALWAYS does some sort of
normalization of filenames.

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From victor@vsespb.ru

We should DEFINITELY document that filesystem names should be treated
as binary data.

I think it's documented that it returns binary data​:

http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen

there are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both,
but it is not.

Problem that it's hard to understand that readdir() will always return
downgraded string

Who knows, maybe sometimes it set (or will set in the future) utf-8 bit
for some strings, and utf-8 bit does not mean "Unicode", it can be
binary data too.

The only thing documented about readdir()​: 1) it does not return
"Unicode". 2) it's output compatible with open()/-X input.

so if we imagine that for filename "\xC2\xB5" readdir set UTF-8 flag, it
won't violate above two statements, but this would be upgraded binary
strings.
it's hard to tell if utf8​::downgrade on filenames is safe or no.

This can become problematic if you normalize a UTF8 text file

I think it's something different. Do you mean Unicode NFC/NFD
normalization? Than probably it's unrelated to perl​:
a) Perl does not silently normalize strings (afaik) b) it's quite
obvious that byte-oriented filesystems (like most in Linux) don't
normalize unicode.

On Tue Aug 20 20​:42​:41 2013, brad wrote​:

On Tue, Aug 20, 2013 at 5​:07 PM, Victor Efimov
<perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Victor Efimov
# Please include the string​: [perl #119395]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=119395 >

readdir() return value should be documented as always downgraded.
otherwise
there is an inconsistency in file functions workflow.

assumptions​:
1. readdir returns binary strings
2. binary data can be (randomly) upgraded or downgraded by 3rd party
code
(it's hard for programmer to control this)
3. Functions working with filenames ( here is list
http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-
Happen )
simply ignore UTF-8 flag
4. Programmer might want to work with filenames as with binary
strings (not
character strings) because filesystem encoding unknown/hard to
detect.

example​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
die $_ unless -e $_;
}

above code fails if binary strings were upgraded and there are
non-ASCII-7bit filenames in current directory.

line "$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);" represent
the
fact that binary string was randomly downgraded or upgraded (perhaps
after concatenation with ASCII string with UTF-8 bit, or maybe after
serialization/deserialization)

solution​: programmer should explicitly call utf8​::downgrade($_)
before "-e"
test​:

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
utf8​::downgrade($_);
die $_ unless -e $_;
}

but that is correct solution only if we are sure that readdir
returns
downgraded byte strings.

same probably true for readlink and @​ARGV

We should DEFINITELY document that filesystem names
should be treated as binary data.
I'm fairly certain we support at least one system
that can have two or more files with the same normalized name.

This can become problematic if you normalize a UTF8 text
file, and use a string from it to determine the name of another
file to read.

So readdir should only return strings with the UTF8 flag set
if the current operating system ALWAYS does some sort of
normalization of filenames.

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From @Leont

On Wed, Aug 21, 2013 at 12​:07 AM, Victor Efimov
<perlbug-followup@​perl.org>wrote​:

readdir() return value should be documented as always downgraded. otherwise
there is an inconsistency in file functions workflow.

I think this "always downgraded" concept is misunderstanding how Unicode
works in Perl. Downgrading means "from utf8/utf-ebcdic to latin-1/ebcdic".
What you probably mean is «readdir() should be documented as always
returning a binary string».

2. binary data can be (randomly) upgraded or downgraded by 3rd party code

(it's hard for programmer to control this)

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade) it on
input, and correctly encode (/downgrade) it on output.
If you want to treat something as a binary string, then all your code needs
to agree on that. You need to do absolutely no encoding or decoding.

Anything else is madness.

If 3rd party code is upgrading your binary data, then you're passing binary
data to a function that is expecting textual data. That mistake is in no
way specific to filenames.

3. Functions working with filenames ( here is list
http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen )
simply ignore UTF-8 flag

Correct.

4. Programmer might want to work with filenames as with binary strings (not

character strings) because filesystem encoding unknown/hard to detect.

Worse, filesystem encoding is generally non-portable. On Windows and Mac,
its encoding is perfectly predictable, on most other operating systems not
at all though you can usually make educated guesses.

solution​: programmer should explicitly call utf8​::downgrade($_) before "-e"
test​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
utf8​::downgrade($_);
die $_ unless -e $_;
}

Utf-8 encoded data will roundtrip an upgrade/downgrade, but the
intermediate result will be mojibake. This is usually the wrong thing to
do, and certainly the wrong thing to recommend.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From victor@vsespb.ru

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.
no, programmer decided to deal with binary filenames (see assumptions(4)
in 1st post), so he don't know the encoding.

I think this "always downgraded" concept is misunderstanding how
Unicode works in Perl
I was talking about binary strings. for example Digest​::SHA​::sha1_hex
accepts binary string and always downgrade it (because binary string can
be suddenly upgraded).

If you want to treat something as a textual string, then all your code
needs to agree on that
no, binary string

If you want to treat something as a binary string, then all your code
needs to agree on that. You need to do absolutely no encoding or decoding.
yes. I don't do any decode()/encode() of binary strings.

If 3rd party code is upgrading your binary data, then you're passing
binary data to a function that is expecting textual data.
in this example this code concatenates $s4 (binary data) with $s3 (ASCII
data) and result is upgraded binary data.

my $s1 = chr(0x100);
my $s2 = "ABC $s1";
my ($s3, undef) = split ' ', $s2;
die unless $s3 eq 'ABC';
my $s4 = "\xf1\xf2";
my $s5 = $s4.$s3;
print utf8​::is_utf8($s5);

Another example when ASCII data can get utf-8 flag​:

use encoding "utf8";
my $x = "x";
print utf8​::is_utf8($x);

variable $x will upgrade binary string if concatenated with it. and it's
hard to tell that programmer's intention was that $x is textual-only
data - it was just ASCII. so it could be binary.

point is programmer cannot control if his binary data upgraded or no.
that is probably why syswrite(), print() (at least with :raw layer),
Digest​::SHA, MIME​::Base64 and other functions, that work with binary
data, always downgrade it.

On Wed Aug 21 01​:33​:37 2013, LeonT wrote​:

On Wed, Aug 21, 2013 at 12​:07 AM, Victor Efimov
<perlbug-followup@​perl.org>wrote​:

readdir() return value should be documented as always downgraded.
otherwise
there is an inconsistency in file functions workflow.

I think this "always downgraded" concept is misunderstanding how
Unicode
works in Perl. Downgrading means "from utf8/utf-ebcdic to latin-
1/ebcdic".
What you probably mean is �readdir() should be documented as always
returning a binary string�.

2. binary data can be (randomly) upgraded or downgraded by 3rd party
code

(it's hard for programmer to control this)

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade)
it on
input, and correctly encode (/downgrade) it on output.
If you want to treat something as a binary string, then all your code
needs
to agree on that. You need to do absolutely no encoding or decoding.

Anything else is madness.

If 3rd party code is upgrading your binary data, then you're passing
binary
data to a function that is expecting textual data. That mistake is in
no
way specific to filenames.

3. Functions working with filenames ( here is list
http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-
Happen )
simply ignore UTF-8 flag

Correct.

4. Programmer might want to work with filenames as with binary strings
(not

character strings) because filesystem encoding unknown/hard to
detect.

Worse, filesystem encoding is generally non-portable. On Windows and
Mac,
its encoding is perfectly predictable, on most other operating systems
not
at all though you can usually make educated guesses.

solution​: programmer should explicitly call utf8​::downgrade($_)
before "-e"
test​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The
right
answer is rarely downgrading IME.

opendir(my $dh, '.');;
for (readdir($dh)) {
$ARGV[0] ? utf8​::upgrade($_) : utf8​::downgrade($_);
utf8​::downgrade($_);
die $_ unless -e $_;
}

Utf-8 encoded data will roundtrip an upgrade/downgrade, but the
intermediate result will be mojibake. This is usually the wrong thing
to
do, and certainly the wrong thing to recommend.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From @Leont

On Wed, Aug 21, 2013 at 11​:19 AM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.
no, programmer decided to deal with binary filenames (see assumptions(4)
in 1st post), so he don't know the encoding.

I just explained that choice in my previous message… You're completely
missing my point on encode versus downgrade here.

I think this "always downgraded" concept is misunderstanding how
Unicode works in Perl
I was talking about binary strings. for example Digest​::SHA​::sha1_hex
accepts binary string and always downgrade it (because binary string can
be suddenly upgraded).

That's silly and unfortunate. If one gets invalid input, one should treat
it as an error, not make wild guesses. Specially when when it does throw an
exception if that string contains a character > 0xFF.

If you want to treat something as a textual string, then all your code
needs to agree on that
no, binary string

Did you notice the word "if"?

point is programmer cannot control if his binary data upgraded or no.

Bullshit. It's more difficult that it should be, but he can absolutely
control it.

that is probably why syswrite(), print() (at least with :raw layer),

PerlIO can explicitly switch both ways. And no, it doesn't always
downgrade, sometimes it just gives up and passes on utf-8 (while throwing a
"Wide character" warning) because it can't do any better. The current
behavior has a high degree of bolted-onness

Digest​::SHA, MIME​::Base64 and other functions, that work with binary

data, always downgrade it.

Neither of those have any excuse to do so IMHO.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 21, 2013

From victor@vsespb.ru

Did you notice the word "if"?

point was I am not talking about textual data. only about binary data.

it does throw an exception if that string contains a character > 0xFF

And no, it doesn't always downgrade, sometimes it just gives up and
passes on utf-8 (while throwing a "Wide character" warning

no, upgradad _binary_ data never raise "Wide character" errors/warnings.

Neither of those have any excuse to do so IMHO.

syswrite with :raw layer do so too.

On Wed Aug 21 03​:43​:02 2013, LeonT wrote​:

On Wed, Aug 21, 2013 at 11​:19 AM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.
no, programmer decided to deal with binary filenames (see assumptions(4)
in 1st post), so he don't know the encoding.

I just explained that choice in my previous message… You're completely
missing my point on encode versus downgrade here.

I think this "always downgraded" concept is misunderstanding how
Unicode works in Perl
I was talking about binary strings. for example Digest​::SHA​::sha1_hex
accepts binary string and always downgrade it (because binary string can
be suddenly upgraded).

That's silly and unfortunate. If one gets invalid input, one should treat
it as an error, not make wild guesses. Specially when when it does
throw an
exception if that string contains a character > 0xFF.

If you want to treat something as a textual string, then all your code
needs to agree on that
no, binary string

Did you notice the word "if"?

point is programmer cannot control if his binary data upgraded or no.

Bullshit. It's more difficult that it should be, but he can absolutely
control it.

that is probably why syswrite(), print() (at least with :raw layer),

PerlIO can explicitly switch both ways. And no, it doesn't always
downgrade, sometimes it just gives up and passes on utf-8 (while
throwing a
"Wide character" warning) because it can't do any better. The current
behavior has a high degree of bolted-onness

Digest​::SHA, MIME​::Base64 and other functions, that work with binary

data, always downgrade it.

Neither of those have any excuse to do so IMHO.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 25, 2013

From @ap

* Leon Timmermans <fawaka@​gmail.com> [2013-08-21 10​:35]​:

Downgrading means "from utf8/utf-ebcdic to latin-1/ebcdic".

Uh, exactly where did you get that idea?

Downgrading means changing the string’s internal representation from the
variable-width multibyte format to the constant-width singlebyte format.
The apparent encoding of the string’s data should be entirely unaffected
by this.

What you probably mean is «readdir() should be documented as always
returning a binary string».

Sure. But the unfortunate fact is that open() and friends still suffer
from The Unicode Bug, and as long as that remains the case then it would
be helpful to also assert that the return value of readdir() is always
downgraded if the OS filesystem API itself does not deal in characters.

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade)
it on input, and correctly encode (/downgrade) it on output. If you
want to treat something as a binary string, then all your code needs
to agree on that. You need to do absolutely no encoding or decoding.

Anything else is madness.

If 3rd party code is upgrading your binary data, then you're passing
binary data to a function that is expecting textual data. That mistake
is in no way specific to filenames.

You are confusing upgrade/downgrade with decode/encode. You are correct
about decode/encode. You are wrong about upgrade/downgrade.

Worse, filesystem encoding is generally non-portable. On Windows and
Mac, its encoding is perfectly predictable, on most other operating
systems not at all though you can usually make educated guesses.

That really doesn’t matter to the issue in question.

solution​: programmer should explicitly call utf8​::downgrade($_)
before "-e" test​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.

That hinges on whether readdir() returns UTF8-encoded filenames decoded
or not. If it doesn’t decode them, and the string was upgraded, it will
be semantically correct, but because open() has The Unicode Bug, it will
treat the filename as double-encoded. So assuming readdir() and friends
do the right thing, then downgrading is both safe and necessary.

That is, currently.

I hope we can eventually fix open() et al and put this issue behind us.

Utf-8 encoded data will roundtrip an upgrade/downgrade, but the
intermediate result will be mojibake.

Only if the UTF8 flag is not respected. Which open() doesn’t. Otherwise
upgrade/downgrade would be invisible.

This is usually the wrong thing to do, and certainly the wrong thing
to recommend.

On the contrary, *if* readdir() does the right, then it is not just the
right thing to do, but also the right thing to recommend…

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2013

From @Leont

On Sun, Aug 25, 2013 at 4​:57 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de>wrote​:

* Leon Timmermans <fawaka@​gmail.com> [2013-08-21 10​:35]​:

Downgrading means "from utf8/utf-ebcdic to latin-1/ebcdic".

Uh, exactly where did you get that idea?

Downgrading means changing the string’s internal representation from the
variable-width multibyte format to the constant-width singlebyte format.

I could have phrased it better (in particular mention internal
representation), but I don't think we're disagreeing here.

The apparent encoding of the string’s data should be entirely unaffected
by this.

Not sure what you mean with that, given that the bug is that there is
anything apparent at all about the internal encoding.

Sure. But the unfortunate fact is that open() and friends still suffer
from The Unicode Bug, and as long as that remains the case then it would
be helpful to also assert that the return value of readdir() is always
downgraded if the OS filesystem API itself does not deal in characters.

I'm not sure what you want; it currently already always returns bytes.
Downgrading would be a no-op.

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade)
it on input, and correctly encode (/downgrade) it on output. If you
want to treat something as a binary string, then all your code needs
to agree on that. You need to do absolutely no encoding or decoding.

Anything else is madness.

If 3rd party code is upgrading your binary data, then you're passing
binary data to a function that is expecting textual data. That mistake
is in no way specific to filenames.

You are confusing upgrade/downgrade with decode/encode. You are correct
about decode/encode. You are wrong about upgrade/downgrade.

upgrade($foo) eq (is_utf8($foo) ? $foo : decode('latin-1', $foo));
downgrade($foo, 0) eq encode('latin-1', $foo)

Downgrade is nothing more or less than an efficient way to encode to
latin-1 (or ebcdic). You could argue it expresses a different intent, but
that doesn't change the result.

Worse, filesystem encoding is generally non-portable. On Windows and
Mac, its encoding is perfectly predictable, on most other operating
systems not at all though you can usually make educated guesses.

That really doesn’t matter to the issue in question.

It does if we ever decide to support Win32's unicode filename APIs (we
currently don't, which is why we can't open files containing non-latin1
names on Windows; this is a rather serious bug if you ask me). Even if we'd
expose that as octets, downgrading a character string would give a bogus
result. Exposing that as characters would make far more sense (and would
align more with what the legacy API already does underneath now).

No, he should not! He should either encode or downgrade, depending on

whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.

That hinges on whether readdir() returns UTF8-encoded filenames decoded
or not. If it doesn’t decode them, and the string was upgraded, it will
be semantically correct,

It would not be semantically correct​: it would already be mojibake, even if
that can be fixed by a downgrade.

but because open() has The Unicode Bug, it will
treat the filename as double-encoded.

Because it is double-encoded!

So assuming readdir() and friends
do the right thing, then downgrading is both safe and necessary.

That is, currently.

It's currently safe and completely unnecessary.

I hope we can eventually fix open() et al and put this issue behind us.

Utf-8 encoded data will roundtrip an upgrade/downgrade, but the
intermediate result will be mojibake.

Only if the UTF8 flag is not respected. Which open() doesn’t. Otherwise
upgrade/downgrade would be invisible.

I think this concept of "UTF8 flag is not respected" is nonsensical in this
context. On the contact points to the outside world, this idea that
"encoding doesn't matter, only the logical content does" breaks down.
That's why we're explicit about these things in IO layers for example. I
don't think there is a right way to handle this implicitly.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 27, 2013

From victor@vsespb.ru

Seems we disagree only about whenever binary data can be upgraded by
accident or no.

So, here is code. Programmer concatenates binary data (filename) with text
data (known to be ASCII actually).

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

I think that fix for this case is downgrading binary string before use ( I
believe @​Aristotle thinks so too) :

===========
my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

utf8​::downgrade($binary_filename);

print "Yes\n" if utf8​::is_utf8($binary_filename);

And I belive @​Leon suggests that fix is encode textual data every time,
before concatenating with binary strings​:

use Encode;

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';

$user_id = encode("ascii", "$user_id");
$binary_filename .= $user_id;

$user_type = encode("ascii", "$user_type");
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

===========

My arguments for downgrading binary data​:

1. downgraded binary data is identical to upgraded (perl 'eq' operator,
hash keys, print, syswrite on :raw - everything!)
2. it's actually filenames, binary filenames. so processed like text (in
File​::Spec etc).
3. sometimes programmer knows exactly what textual data will be like. it
can be 100% numbers or plain ASCII strings. it should be safe to merge
number or ASCII with binary data.
4. from JSON​::XS docs​: "You can have Unicode strings with that flag set,
with that flag clear, and you can have binary data with that flag set and
that flag clear"
5. Digest​::SHA, MIME​::Base64, JSON​::XS downgrade binary data before use.
Just like perl. Other authors might want to act like perl too. Every "Wide
character in subroutine entry" message comes from attempt to downgrade
binary data

2013/8/28 Leon Timmermans <fawaka@​gmail.com>

On Sun, Aug 25, 2013 at 4​:57 AM, Aristotle Pagaltzis <pagaltzis@​gmx.de>wrote​:

* Leon Timmermans <fawaka@​gmail.com> [2013-08-21 10​:35]​:

Downgrading means "from utf8/utf-ebcdic to latin-1/ebcdic".

Uh, exactly where did you get that idea?

Downgrading means changing the string’s internal representation from the
variable-width multibyte format to the constant-width singlebyte format.

I could have phrased it better (in particular mention internal
representation), but I don't think we're disagreeing here.

The apparent encoding of the string’s data should be entirely unaffected
by this.

Not sure what you mean with that, given that the bug is that there is
anything apparent at all about the internal encoding.

Sure. But the unfortunate fact is that open() and friends still suffer
from The Unicode Bug, and as long as that remains the case then it would
be helpful to also assert that the return value of readdir() is always
downgraded if the OS filesystem API itself does not deal in characters.

I'm not sure what you want; it currently already always returns bytes.
Downgrading would be a no-op.

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade)
it on input, and correctly encode (/downgrade) it on output. If you
want to treat something as a binary string, then all your code needs
to agree on that. You need to do absolutely no encoding or decoding.

Anything else is madness.

If 3rd party code is upgrading your binary data, then you're passing
binary data to a function that is expecting textual data. That mistake
is in no way specific to filenames.

You are confusing upgrade/downgrade with decode/encode. You are correct
about decode/encode. You are wrong about upgrade/downgrade.

upgrade($foo) eq (is_utf8($foo) ? $foo : decode('latin-1', $foo));
downgrade($foo, 0) eq encode('latin-1', $foo)

Downgrade is nothing more or less than an efficient way to encode to
latin-1 (or ebcdic). You could argue it expresses a different intent, but
that doesn't change the result.

Worse, filesystem encoding is generally non-portable. On Windows and
Mac, its encoding is perfectly predictable, on most other operating
systems not at all though you can usually make educated guesses.

That really doesn’t matter to the issue in question.

It does if we ever decide to support Win32's unicode filename APIs (we
currently don't, which is why we can't open files containing non-latin1
names on Windows; this is a rather serious bug if you ask me). Even if we'd
expose that as octets, downgrading a character string would give a bogus
result. Exposing that as characters would make far more sense (and would
align more with what the legacy API already does underneath now).

No, he should not! He should either encode or downgrade, depending on

whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.

That hinges on whether readdir() returns UTF8-encoded filenames decoded
or not. If it doesn’t decode them, and the string was upgraded, it will
be semantically correct,

It would not be semantically correct​: it would already be mojibake, even
if that can be fixed by a downgrade.

but because open() has The Unicode Bug, it will
treat the filename as double-encoded.

Because it is double-encoded!

So assuming readdir() and friends
do the right thing, then downgrading is both safe and necessary.

That is, currently.

It's currently safe and completely unnecessary.

I hope we can eventually fix open() et al and put this issue behind us.

Utf-8 encoded data will roundtrip an upgrade/downgrade, but the
intermediate result will be mojibake.

Only if the UTF8 flag is not respected. Which open() doesn’t. Otherwise
upgrade/downgrade would be invisible.

I think this concept of "UTF8 flag is not respected" is nonsensical in
this context. On the contact points to the outside world, this idea that
"encoding doesn't matter, only the logical content does" breaks down.
That's why we're explicit about these things in IO layers for example. I
don't think there is a right way to handle this implicitly.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From @Leont

On Tue, Aug 27, 2013 at 11​:59 PM, Victor Efimov <victor@​vsespb.ru> wrote​:

Seems we disagree only about whenever binary data can be upgraded by
accident or no.

So, here is code. Programmer concatenates binary data (filename) with text
data (known to be ASCII actually).

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

I think that fix for this case is downgrading binary string before use ( I
believe @​Aristotle thinks so too) :

===========
my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

utf8​::downgrade($binary_filename);

print "Yes\n" if utf8​::is_utf8($binary_filename);

And I belive @​Leon suggests that fix is encode textual data every time,
before concatenating with binary strings​:

use Encode;

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';

$user_id = encode("ascii", "$user_id");
$binary_filename .= $user_id;

$user_type = encode("ascii", "$user_type");
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

===========

No, I believe the fix is either of​:
1. decode/upgrade $binary_filename on input, and encode/downgrade it at
output (thus the data is always treated as text).
2. encode/downgrade the $user_id/$user_type before concatenating it (thus
the data is always treated as binary)

In particular, I prefer preventing implicit upgrades, as I find them
unreliable.

Your approach works in this case because your explicit downgrade is matched
by an implicit upgrade. It's similar to option 1, but it stops working as
soon as (non-latin1) decoded data comes into play.

My arguments for downgrading binary data​:

In discussing these things, sometimes some words mean different things to
different people. To me «upgraded binary data» is a contradictio in
terminis.

1. downgraded binary data is identical to upgraded (perl 'eq' operator,
hash keys, print, syswrite on :raw - everything!)

That doesn't necessarily make it sensible to do.

2. it's actually filenames, binary filenames. so processed like text (in
File​::Spec etc).

That makes no sense to me, care to explain?

3. sometimes programmer knows exactly what textual data will be like. it
can be 100% numbers or plain ASCII strings. it should be safe to merge
number or ASCII with binary data.

That is true either way.

4. from JSON​::XS docs​: "You can have Unicode strings with that flag set,
with that flag clear, and you can have binary data with that flag set and
that flag clear"

That's not exactly relevant to this discussion.

5. Digest​::SHA, MIME​::Base64, JSON​::XS downgrade binary data before use.
Just like perl. Other authors might want to act like perl too.

Both JSON​::XS and PerlIO allow you to explicitly switch between either
behavior, character string or byte string.

I think that sort of choice is often the best way to go forward. E.g.
«encode_base64(encode("UTF-8", "\x{FFFF}\n"));» could become
«encode_base64("\x{FFFF}\n", "UTF-8");»

Every "Wide character in subroutine entry" message comes from attempt to
downgrade binary data

No, it comes from downgrading data, not necessarily 'binary' data.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

No, I believe the fix is either of​:
1. decode/upgrade $binary_filename on input, and encode/downgrade it at
output (thus the data is always treated as text).
2. encode/downgrade the $user_id/$user_type before concatenating it (thus
the data is always treated as binary)

(1) - this ticket was only about case when programmer don't know the
encoding, thus he works with filenames as with "binary strings" (not
character strings). he can't decode.

(2) so I was right when told that
"And I belive @​Leon suggests that fix is encode textual data every time,
before concatenating with binary strings"
except yes, programmer can just downgrade() strings if he sure that it's
ASCII.

Your approach works in this case because your explicit downgrade is
matched
by an implicit upgrade. It's similar to option 1, but it stops working as
soon as (non-latin1) decoded data comes into play.

No! I am talking only about filenames as "binary strings" - (that was
mentioned in the 1st post).
Binary data can be safely upgraded and downgraded any number of times​:

====
my $binary_data = "\xf1\xf2\xf3";
my $s = $binary_data;
utf8​::upgrade($s) for (1..$ARGV[0]);
utf8​::downgrade($s) for (1..$ARGV[1]);
die unless $s eq $binary_data;
print "Fine\n";

====

In discussing these things, sometimes some words mean different things to
different people. To me �upgraded binary data� is a contradictio in
terminis.

"binary" explained here​:

http​://perldoc.perl.org/perlunifaq.html#How-can-I-determine-if-a-string-is-a-text-string-or-a-binary-string?

"How can I determine if a string is a text string or a binary string?"
"You can't. "
...

"This is something you, the programmer, has to keep track of; sorry."

http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen

"The following are such interfaces. Also, see The Unicode Bug. For all
of these interfaces Perl currently (as of v5.16.0) simply
assumes byte strings both as arguments and results, "

So programmer can treat filenames as binary data. And work with it just
like with binary data.

That makes no sense to me, care to explain?

I meant filenames are binary data. but often people have to work with
them as with text. That makes things complicated.

On Wed Aug 28 10​:44​:12 2013, LeonT wrote​:

On Tue, Aug 27, 2013 at 11​:59 PM, Victor Efimov <victor@​vsespb.ru> wrote​:

Seems we disagree only about whenever binary data can be upgraded by
accident or no.

So, here is code. Programmer concatenates binary data (filename)
with text
data (known to be ASCII actually).

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

I think that fix for this case is downgrading binary string before
use ( I
believe @​Aristotle thinks so too) :

===========
my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';
$binary_filename .= $user_id;
$binary_filename .= $user_type;

utf8​::downgrade($binary_filename);

print "Yes\n" if utf8​::is_utf8($binary_filename);

And I belive @​Leon suggests that fix is encode textual data every time,
before concatenating with binary strings​:

use Encode;

my $binary_filename = "\xf1\xf2";
my ($user_id, $user_type, undef) = split ' ', "123 ABC \x{100}";
die unless $user_id eq '123';
die unless $user_type eq 'ABC';

$user_id = encode("ascii", "$user_id");
$binary_filename .= $user_id;

$user_type = encode("ascii", "$user_type");
$binary_filename .= $user_type;

print "Yes\n" if utf8​::is_utf8($binary_filename);

===========

No, I believe the fix is either of​:
1. decode/upgrade $binary_filename on input, and encode/downgrade it at
output (thus the data is always treated as text).
2. encode/downgrade the $user_id/$user_type before concatenating it (thus
the data is always treated as binary)

In particular, I prefer preventing implicit upgrades, as I find them
unreliable.

Your approach works in this case because your explicit downgrade is
matched
by an implicit upgrade. It's similar to option 1, but it stops working as
soon as (non-latin1) decoded data comes into play.

My arguments for downgrading binary data​:

In discussing these things, sometimes some words mean different things to
different people. To me �upgraded binary data� is a contradictio in
terminis.

1. downgraded binary data is identical to upgraded (perl 'eq' operator,
hash keys, print, syswrite on :raw - everything!)

That doesn't necessarily make it sensible to do.

2. it's actually filenames, binary filenames. so processed like text (in
File​::Spec etc).

That makes no sense to me, care to explain?

3. sometimes programmer knows exactly what textual data will be like. it
can be 100% numbers or plain ASCII strings. it should be safe to merge
number or ASCII with binary data.

That is true either way.

4. from JSON​::XS docs​: "You can have Unicode strings with that flag set,
with that flag clear, and you can have binary data with that flag
set and
that flag clear"

That's not exactly relevant to this discussion.

5. Digest​::SHA, MIME​::Base64, JSON​::XS downgrade binary data before use.
Just like perl. Other authors might want to act like perl too.

Both JSON​::XS and PerlIO allow you to explicitly switch between either
behavior, character string or byte string.

I think that sort of choice is often the best way to go forward. E.g.
�encode_base64(encode("UTF-8", "\x{FFFF}\n"));� could become
�encode_base64("\x{FFFF}\n", "UTF-8");�

Every "Wide character in subroutine entry" message comes from attempt to
downgrade binary data

No, it comes from downgrading data, not necessarily 'binary' data.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From @xdg

On Tue, Aug 20, 2013 at 6​:07 PM, Victor Efimov
<perlbug-followup@​perl.org> wrote​:

readdir() return value should be documented as always downgraded. otherwise
there is an inconsistency in file functions workflow.

I've really gotten lost in this thread. I think the original point of
documenting readdir() makes sense. We shouldn't bury the semantics
over in the Unicode documentation.

I think the remaining confusion is over what to recommend for
manipulating scalars that contain filenames.

Here, I think I more or less agree with Leon that preventing implicit
upgrades during concatenation (or other string operations) seems wise.

In other words, if you know you're dealing with binary data, make sure
any strings you operate against are also "binary" (i.e. encoded with
the internal UTF-8 flag off).

Thus it seems like the recommendation should be​:

(1) If you know the encoding of a name read from readdir(), decode it
to characters, manipulate it how you want, and then encode it again
before using it to interact with the filesystem.

(2) If you *don't* know the encoding, make sure any strings you're
concatenating, etc. are ASCII with the UTF-8 flag turned off.

Have I misunderstood something?

David

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From zefram@fysh.org

David Golden wrote​:

Here, I think I more or less agree with Leon that preventing implicit
upgrades during concatenation (or other string operations) seems wise.

I disagree. The internal representation should be as invisible as
possible. This means that using an upgraded string must not break any
string operation. Implicit upgrading is essential when the correct
result of a string operation includes any codepoint above 0xff.

It also means that operators that take a string operand and just use the
internal PV buffer are faulty. A Unix pathname is a string of octets with
values 0x01 to 0xff; in Perl, that's a string matching /\A[\x01-\xff]*\z/.
An operator such as opendir that takes a pathname as an operand should
(on Unix) accept the Perl strings that match that regexp, and give
each such string the semantics that its Perl-visible value implies,
regardless of how the string is represented internally. Since opendir
must pass on a NUL-terminated octet string to libc's opendir(3), that
means pp_opendir needs to internally downgrade its operand. It should
also reject an operand that doesn't match /\A[\x01-\xff]*\z/.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From victor@vsespb.ru

I've really gotten lost in this thread. I think the original point of
documenting readdir() makes sense. We shouldn't bury the semantics
over in the Unicode documentation.

I actually though that _could_ be placed at _least_ in Unicode documentation

Currenly what is documented is​:

http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen

"For all of these interfaces Perl currently (as of v5.16.0) simply
assumes byte strings both as arguments and results"

i.e. documented as "byte strings" without note that it's downgraded().

I think the remaining confusion is over what to recommend for
manipulating scalars that contain filenames.

Here, I think I more or less agree with Leon that preventing implicit
upgrades during concatenation (or other string operations) seems wise.

Yes, it's indeed wise. At least it's faster and saves memory. But is
this the only allowed approach?

Perl code (eq, print, sysrwite) works fine with upgraded binary strings.
3rd party modules (Digest​::SHA) too.
So I assume people allowed to upgrade strings by accident (because perl
will fix it when needed).

I think it's not documented that upgraded binary data is invalid.
upgrade documented as change of internal representation of string which
does not affect anything.

I also suspect there can be existing code written which work with binary
data and sometimes upgrade it (relying on fact that it will be
downgraded by perl when needed).

Now, if we recommend never concatenate strings with UTF-8 flag with
filenames, it would look like​:

"If you ever concatenate ASCII string with UTF-8 bit with filename,
filename will be broken and could not be reliably fixed with
utf8​::downgrade,
because it's not guaranteed that filename was initially downgraded".

After that it seems we'll have three categories of strings​:

1. character strings
2. binary strings (you can upgrade it - perl will downgrade for you)
3. filenames - never upgrade it!

previously we had only two categories (1,2)

So I suggest simply document that filenames are always downgraded, so
smart users can decide by themselves that they can downgrade it before
or after concatenation
with ASCII strings with UTF-8 bit.

Have I misunderstood something?

No.

On Wed Aug 28 12​:12​:34 2013, xdg@​xdg.me wrote​:

On Tue, Aug 20, 2013 at 6​:07 PM, Victor Efimov
<perlbug-followup@​perl.org> wrote​:

readdir() return value should be documented as always downgraded.
otherwise
there is an inconsistency in file functions workflow.

I've really gotten lost in this thread. I think the original point of
documenting readdir() makes sense. We shouldn't bury the semantics
over in the Unicode documentation.

I think the remaining confusion is over what to recommend for
manipulating scalars that contain filenames.

Here, I think I more or less agree with Leon that preventing implicit
upgrades during concatenation (or other string operations) seems wise.

In other words, if you know you're dealing with binary data, make sure
any strings you operate against are also "binary" (i.e. encoded with
the internal UTF-8 flag off).

Thus it seems like the recommendation should be​:

(1) If you know the encoding of a name read from readdir(), decode it
to characters, manipulate it how you want, and then encode it again
before using it to interact with the filesystem.

(2) If you *don't* know the encoding, make sure any strings you're
concatenating, etc. are ASCII with the UTF-8 flag turned off.

Have I misunderstood something?

David

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From zefram@fysh.org

Victor Efimov via RT wrote​:

http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen

"For all of these interfaces Perl currently (as of v5.16.0) simply
assumes byte strings both as arguments and results"

That sort of statement is rather ambiguous. From the now-recommended
point of view, a "byte string" is the same thing as a "string characters
whose codepoints do not exceed 0xff", i.e., a string matching
/\A[\x00-\xff]*\z/, and it can be either upgraded or downgraded.
Limiting operations to such strings is one way to "not do Unicode".

But what that statement in perlunicode(1) really means is that these
interfaces use the internal PV without regard to the SvUTF8 flag that says
what the PV means. The sanest way to interpret it is that it's taking
"byte string" to mean "downgraded" string. The operators don't check the
SvUTF8 flag because they assume that you supplied a downgraded string,
and things go wrong when that assumption turns out to be false.

Actually that documentation isn't written to be interpreted that way.
It's just been written carelessly, with a loose usage of "assume".
We can improve it.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From @Leont

On Wed, Aug 28, 2013 at 9​:27 PM, Zefram <zefram@​fysh.org> wrote​:

I disagree. The internal representation should be as invisible as
possible. This means that using an upgraded string must not break any
string operation. Implicit upgrading is essential when the correct
result of a string operation includes any codepoint above 0xff.

I don't see the disagreement.

It also means that operators that take a string operand and just use the
internal PV buffer are faulty. A Unix pathname is a string of octets with
values 0x01 to 0xff; in Perl, that's a string matching /\A[\x01-\xff]*\z/.
An operator such as opendir that takes a pathname as an operand should
(on Unix) accept the Perl strings that match that regexp, and give
each such string the semantics that its Perl-visible value implies,
regardless of how the string is represented internally. Since opendir
must pass on a NUL-terminated octet string to libc's opendir(3), that
means pp_opendir needs to internally downgrade its operand. It should
also reject an operand that doesn't match /\A[\x01-\xff]*\z/.

That's an entirely different (but valid) discussion, orthogonal to this
ticket.

Leon

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From @ikegami

On Wed, Aug 21, 2013 at 4​:32 AM, Leon Timmermans <fawaka@​gmail.com> wrote​:

I think this "always downgraded" concept is misunderstanding how Unicode
works in Perl. Downgrading means "from utf8/utf-ebcdic to latin-1/ebcdic".

It has noting to do with UTF-8 or latin-1. (Both upgraded and downgraded
strings can hold any of those.) It means "use 8-bit chars instead of
variable-width chars to store the string".

What you probably mean is «readdir() should be documented as always
returning a binary string».

Indeed.

2. binary data can be (randomly) upgraded or downgraded by 3rd party code

(it's hard for programmer to control this)

If you want to treat something as a textual string, then all your code
needs to agree on that. You will need to correctly decode (/upgrade) it on
input, and correctly encode (/downgrade) it on output.

You will need to correctly decode it on input, and correctly encode it on
output.

(The only time you need to upgrade or downgrade is when you deal with buggy
code.)

3. Functions working with filenames ( here is list

http​://perldoc.perl.org/perlunicode.html#When-Unicode-Does-Not-Happen )
simply ignore UTF-8 flag

Correct.

Unfortunately. This is one of the last instances of The Unicode Bug in core.

4. Programmer might want to work with filenames as with binary strings (not

character strings) because filesystem encoding unknown/hard to detect.

Worse, filesystem encoding is generally non-portable. On Windows and Mac,
its encoding is perfectly predictable, on most other operating systems not
at all though you can usually make educated guesses.

solution​: programmer should explicitly call utf8​::downgrade($_) before
"-e"
test​:

No, he should not! He should either encode or downgrade, depending on
whether he wants his filename to be utf-8 or latin-1 encoded. The right
answer is rarely downgrading IME.

But it is here. upgrade and downgrade is the only way to get predictable
behaviour from functions suffering from The Unicode Bug.

Perl treats file names as bytes, and its operators expects these bytes to
be stored in strings using the downgraded string storage format.

On some systems, you can get away with passing upgraded Unicode code points.

Options​:

-e _d($file_name_bytes)
-e _u(decode_utf8($file_name_bytes)) # UTF-8 systems only.

sub _d { my ($s) = @​_; utf8​::downgrade($s); $s }
sub _u { my ($s) = @​_; utf8​::upgrade($s); $s }

(In that particular example, _u isn't actually needed because decode_utf8
"guarantees" it.)

- Eric

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From @xdg

On Wed, Aug 28, 2013 at 3​:27 PM, Zefram <zefram@​fysh.org> wrote​:

I disagree. The internal representation should be as invisible as
possible. This means that using an upgraded string must not break any
string operation. Implicit upgrading is essential when the correct
result of a string operation includes any codepoint above 0xff.

It also means that operators that take a string operand and just use the
internal PV buffer are faulty.

If you flip those statements around, it's because we have faulty
operators that the internal representation can't (currently) be
invisible. (Thus, Leon's suggestion that users decode on read and
encode on use.)

I agree 100% that it would be great if it could be invisible, but I'm
not sure how without knowing something about what the filesystem
produces/accepts and hinting about the encoding.

  readdir( $dir_handle, "​:utf8" ); # read and decode

For file *content* we have layers. For file *names* we make users
manage encodings themselves.

David

--
David Golden <xdg@​xdg.me>
Take back your inbox! → http​://www.bunchmail.com/
Twitter/IRC​: @​xdg

@p5pRT
Copy link
Author

p5pRT commented Aug 28, 2013

From Mark@Overmeer.net

* David Golden (xdg@​xdg.me) [130828 22​:09]​:

On Wed, Aug 28, 2013 at 3​:27 PM, Zefram <zefram@​fysh.org> wrote​:
I agree 100% that it would be great if it could be invisible, but I'm
not sure how without knowing something about what the filesystem
produces/accepts and hinting about the encoding.

readdir\( $dir\_handle\, "&#8203;:utf8" \); \# read and decode

For file *content* we have layers. For file *names* we make users
manage encodings themselves.

Although very well possible, it is very inconvenient when the admin has
configured different character-sets per mount-point. So, probably one
global variable will do. You get all kinds of problems when the
intention about the charact-set in paths differs from the charset
in the environment of the user as well...

Therefore, a good default is probably the codeset in LC_CTYPE
  [language[_territory][.codeset][@​modifier]]
where "codeset" defaults to latin1 (POSIX says)
--
  MarkOv


  Mark Overmeer MSc MARKOV Solutions
  Mark@​Overmeer.net solutions@​overmeer.net
http​://Mark.Overmeer.net http​://solutions.overmeer.net

@p5pRT
Copy link
Author

p5pRT commented Aug 29, 2013

From victor@vsespb.ru

- this ticket only for case when encoding is unknown and filenames are
binary string
- even if character string filenames implemented this ticket is still
valid, because binary filenames mode should be default for old code

2013/8/29 Mark Overmeer <mark@​overmeer.net>

* David Golden (xdg@​xdg.me) [130828 22​:09]​:

On Wed, Aug 28, 2013 at 3​:27 PM, Zefram <zefram@​fysh.org> wrote​:
I agree 100% that it would be great if it could be invisible, but I'm
not sure how without knowing something about what the filesystem
produces/accepts and hinting about the encoding.

readdir\( $dir\_handle\, "&#8203;:utf8" \); \# read and decode

For file *content* we have layers. For file *names* we make users
manage encodings themselves.

Although very well possible, it is very inconvenient when the admin has
configured different character-sets per mount-point. So, probably one
global variable will do. You get all kinds of problems when the
intention about the charact-set in paths differs from the charset
in the environment of the user as well...

Therefore, a good default is probably the codeset in LC_CTYPE
[language[_territory][.codeset][@​modifier]]
where "codeset" defaults to latin1 (POSIX says)
--
MarkOv

------------------------------------------------------------------------
Mark Overmeer MSc MARKOV Solutions
Mark@​Overmeer.net solutions@​overmeer.net
http​://Mark.Overmeer.net http​://solutions.overmeer.net

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants