Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A portion of backrefs from s///eg aren't downgraded under use bytes pragma. #10094

Open
p5pRT opened this issue Jan 19, 2010 · 22 comments
Open

Comments

@p5pRT
Copy link

p5pRT commented Jan 19, 2010

Migrated from rt.perl.org#72208 (status was 'open')

Searchable as RT72208$

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From aklaswad@gmail.com

Created by aklaswad@gmail.com

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use Test​::More tests => 3;
my $str = "Ax�$B$"$$$&$($*�(Bxx�$B$+$-$/$1$3�(Bxx�$B$5$7$9$;$=�(Bx";
{
  use bytes;
  $str =~ s/x(.+?)x/
  ok( !Encode​::is_utf8( $1 ) );
  /eg;
}

Expected​: success for all tests. ( or at least, fail for all tests. )

Observed​: only the first test is successed, and rest are failed.

1..3
ok 1
not ok 2
# Failed test at ./perlbug.pl line 11.
not ok 3
# Failed test at ./perlbug.pl line 11.
# Looks like you failed 2 tests of 3.

I also found that the behaviour on perl 5.8.8 is more funny.
the result is same as 5.11.3, but, when removed the head character 'A'
from $str, this script can pass all tests in perl5.8.8.

Perl Info

Flags:
   category=core
   severity=low

Site configuration information for perl 5.11.3:

Configured by asawada at Fri Dec 18 19:59:19 JST 2009.

Summary of my perl5 (revision 5 version 11 subversion 3) configuration:
 Commit id: 27bca3226281a592aed848b7e68ea50f27381dac
 Platform:
   osname=linux, osvers=2.6.24-19-server, archname=i686-linux
   uname='linux sawadaubuntu 2.6.24-19-server #1 smp wed aug 20
23:54:28 utc 2008 i686 gnulinux '
   config_args='-ds -e -Dprefix=/usr/local/perl-5.11.3 -Dusedevel'
   hint=recommended, useposix=true, d_sigaction=define
   useithreads=undef, usemultiplicity=undef
   useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
   use64bitint=undef, use64bitall=undef, uselongdouble=undef
   usemymalloc=n, bincompat5005=undef
 Compiler:
   cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -D_LARGEFILE_SOURC\
E -D_FILE_OFFSET_BITS=64',
   optimize='-O2',
   cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
   ccversion='', gccversion='4.2.4 (Ubuntu 4.2.4-1ubuntu4)', gccosandvers=''
   intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
   d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
   ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
   alignbytes=4, prototype=define
 Linker and Libraries:
   ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
   libpth=/usr/local/lib /lib /usr/lib
   libs=-lnsl -ldl -lm -lcrypt -lutil -lc
   perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
   libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
   gnulibc_version='2.7'
 Dynamic Linking:
   dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
   cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib
-fstack-protector'

Locally applied patches:



@INC for perl 5.11.3:
   /usr/local/perl-5.11.3/lib/site_perl/5.11.3/i686-linux
   /usr/local/perl-5.11.3/lib/site_perl/5.11.3
   /usr/local/perl-5.11.3/lib/5.11.3/i686-linux
   /usr/local/perl-5.11.3/lib/5.11.3
   .


Environment for perl 5.11.3:
   HOME=/home/asawada
   LANG=ja_JP.UTF-8
   LANGUAGE (unset)
   LD_LIBRARY_PATH (unset)
   LOGDIR (unset)
   PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
   PERL_BADLANG (unset)
   SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From zefram@fysh.org

Akira Sawada wrote​:

  ok\( \!Encode​::is\_utf8\( $1 \) \);

You shouldn't be testing the "utf8" flag. It's an internal implementation
detail, which should never have been exposed by a function named
"is_utf8". It is generally not a bug for a string to be upgraded or
downgraded when you weren't expecting it.

There are, unfortunately, some places where string upgrading matters, in
the sense that behaviour varies according to the "utf8" flag even without
you directly peeking at it. These situations are bugs. Please let us
know about ones that you run into.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From mark@exonetric.com

On 19 Jan 2010, at 20​:34, Zefram wrote​:

There are, unfortunately, some places where string upgrading matters, in
the sense that behaviour varies according to the "utf8" flag even without
you directly peeking at it. These situations are bugs. Please let us
know about ones that you run into.

I wonder about the case of concatenating a decoded string ("utf8 on") with an
encoded string ("utf8 off") and the resulting string becomes considered decoded
("utf8 on")? This leads to the result that further encoding results in double-encoding
for parts of the string.

This must be deliberate behaviour rather than a bug, but I'm curious what
the rationale here is. We ran into this quite recently and it was a bit of
effort to track it down as I had even appreciated there was a variation in the
internal representation.

I almost wonder if there shouldn't be more default warnings when this happens.

Pointers to archive postings on this point appreciated.

- Mark

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From zefram@fysh.org

Mark Blackman wrote​:

I wonder about the case of concatenating a decoded string ("utf8 on") with an
encoded string ("utf8 off")

I think what you mean by "decoded" and "encoded" here is that you have
(a) a string of characters and (b) a string of octets which represent
characters by means of the UTF-8 encoding. Concatenating these two is
a meaningless operation, and you shouldn't expect any sensible result.
What actually happens is that, for historical reasons, Perl doesn't
distinguish between octets and low-codepoint characters, so effectively
the octets get reinterpreted as Latin-1 characters. The "double encoding"
that you refer to is what happens if those bogus Latin-1 characters get
encoded as octets, typically using UTF-8.

The "utf8" flag has very little to do with the character encoding
operations that you actually want to think about.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From mark@exonetric.com

On 19 Jan 2010, at 21​:17, Zefram wrote​:

Mark Blackman wrote​:

I wonder about the case of concatenating a decoded string ("utf8 on") with an
encoded string ("utf8 off")

I think what you mean by "decoded" and "encoded" here is that you have
(a) a string of characters and (b) a string of octets which represent
characters by means of the UTF-8 encoding. Concatenating these two is
a meaningless operation, and you shouldn't expect any sensible result.

Indeed I should not, but this sort of thing does happen rather invisibly
when you fill templates with values pulled from a database.

I'm wondering if there's any moral support for warning about these cases
explicitly by default, like

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perhaps my situation is too rare to address with changes to core.
I didn't realise this was happening until I did some rather speculative
research though.

What actually happens is that, for historical reasons, Perl doesn't
distinguish between octets and low-codepoint characters, so effectively
the octets get reinterpreted as Latin-1 characters. The "double encoding"
that you refer to is what happens if those bogus Latin-1 characters get
encoded as octets, typically using UTF-8.

Yes, this is what I had deduced and with the help of this article,

http​://ahinea.com/en/tech/perl-unicode-struggle.html

- Mark

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From zefram@fysh.org

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From mark@exonetric.com

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

the quote from perlunicode, below,

'If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have character
semantics. This can cause surprises​: See"BUGS", below'

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

- Mark

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From zefram@fysh.org

Mark Blackman wrote​:

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

You're not misreading it, it's misleading you. There's a lot of such
bad terminology in the documentation, which we hope to fix.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From mark@exonetric.com

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

-zefram

I can still imagine a, probably optional, probably quite slow, test in the
concatenation operator that looks at both operands and if it finds

a) either operand has non-ASCII octets without a UTF8 flag, and
b) the other operand does have the UTF8 flag

delivers some kind of warning. The potential for a false positive
is still there, of course, but it might be nice to optionally
assume 'ascii' encoding during concatentation and choke on the
latin1-specific octets.

I suspect I'm in a small minority on the "niceness" of that approach.

- Mark

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From @greerga

On Tue, 19 Jan 2010, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

I think that's what encoding​::warnings does, right?

http​://search.cpan.org/~audreyt/encoding-warnings-0.11/lib/encoding/warnings.pm

--
George Greer

@p5pRT
Copy link
Author

p5pRT commented Jan 19, 2010

From mark@exonetric.com

On 19 Jan 2010, at 23​:39, George Greer wrote​:

On Tue, 19 Jan 2010, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.

Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

I think that's what encoding​::warnings does, right?

http​://search.cpan.org/~audreyt/encoding-warnings-0.11/lib/encoding/warnings.pm

thanks, that's what I was looking for.

- Mark

@p5pRT
Copy link
Author

p5pRT commented Jan 20, 2010

From @ap

* Zefram <zefram@​fysh.org> [2010-01-19 22​:25]​:

Perl doesn't distinguish between characters and octets, and
retrofitting such a distinction is pretty infeasible.

Actually it’s not, but it requires cooperative code, and few
people even know that they can play along, much less that they
need to​: http​://search.cpan.org/perldoc?BLOB

(In my personal bikeshed, the module would have been CHARDATA and
its significance would be opposite of BLOB, because there’s only
one kind of character string, but many different kinds of binary
data, so it would be nice to allow for marking them distinctly.)

I wonder if putting facilities of this sort into core would be
a good idea, to accompany the doc fixes. Eg. what if Encode
automatically un-/marked strings with CHARDATA when en-/decoding
them? And the docs consistently promoted this feature?

Being able to tell apart character and binary data is a valid
concern. Core currently does diddly squat to cater to it. Small
wonder people latch onto the seductively named UTF8 flag. And
mistaken as they may be, they’re not all falling into the trap
en masse due to systematic stupidity. They do because Perl is
failing them.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Jan 20, 2010

From @nwc10

On Tue, Jan 19, 2010 at 10​:53​:06PM +0000, Mark Blackman wrote​:

I can still imagine a, probably optional, probably quite slow, test in the
concatenation operator that looks at both operands and if it finds

a) either operand has non-ASCII octets without a UTF8 flag, and
b) the other operand does have the UTF8 flag

delivers some kind of warning. The potential for a false positive
is still there, of course, but it might be nice to optionally
assume 'ascii' encoding during concatentation and choke on the
latin1-specific octets.

I suspect I'm in a small minority on the "niceness" of that approach.

My brain has been chewing over things like this for some weeks now.

I don't have any conclusions yet, other than "it might need 2 flags"

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From @khwilliamson

Mark Blackman wrote​:

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.
Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

the quote from perlunicode, below,

'If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have character
semantics. This can cause surprises​: See"BUGS", below'

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

- Mark

Zefram wrote​:

Mark Blackman wrote​:

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

You're not misreading it, it's misleading you. There's a lot of such
bad terminology in the documentation, which we hope to fix.

-zefram

I wrote that piece of the pod, and rereading it, it still makes perfect
sense to me, so I'm wondering what the confusion is, and how it could be
said better. See below.

Mark Blackman wrote​:

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.
Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

-zefram

I can still imagine a, probably optional, probably quite slow, test in the
concatenation operator that looks at both operands and if it finds

a) either operand has non-ASCII octets without a UTF8 flag, and
b) the other operand does have the UTF8 flag

delivers some kind of warning. The potential for a false positive
is still there, of course, but it might be nice to optionally
assume 'ascii' encoding during concatentation and choke on the
latin1-specific octets.

I suspect I'm in a small minority on the "niceness" of that approach.

- Mark

What I was describing in the pod is that if a string contains octets
with the upper bit set (on ASCIIish machines), but the string doesn't
have its UTF8 flag set, those octets don't have any semantics beyond
their ordinal number and that they don't have any other properties.
That pod (and elsewhere) calls this "byte semantics."

If that string is concatenated with a string that does have the UTF8
flag set, all such octets suddenly change to having Unicode (called
"character") semantics.

I didn't want to mention the UTF8 flag detail at that point, as I
believed it to be too internal for that portion of the documentation. I
didn't come up with the terms "byte" or "character" semantics; don't
particularly like them; but I think they're pretty entrenched.

Are there suggestions for improving the wording?

And, as far as Mark's point in the last snippet, yes I think such a
warning could be done. I imagine it would be slow, so would have to be
in a separate warning category which could be turned off separately
(until the perltodo gets done to turn individual warnings off). It
could be sped up some if is_ascii_string() were recoded to do
word-by-word searching. But it would still be slow.

The problem is finding someone who wants to do this code change. My
efforts are being expended so that if one uses "use feature
'unicode_strings'", that the problem doesn't arise in the first place.

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From mark@exonetric.com

On 09/03/2010 15​:04, karl williamson wrote​:

Mark Blackman wrote​:

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.
Perl doesn't distinguish between characters and octets, and
retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

the quote from perlunicode, below,

'If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have character
semantics. This can cause surprises​: See"BUGS", below'

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

- Mark

Zefram wrote​:

Mark Blackman wrote​:

suggests to me that Perl does have some way of flagging these
two cases, but perhaps I'm misreading it.

You're not misreading it, it's misleading you. There's a lot of such
bad terminology in the documentation, which we hope to fix.

-zefram

I wrote that piece of the pod, and rereading it, it still makes perfect
sense to me, so I'm wondering what the confusion is, and how it could be
said better. See below.

Mark Blackman wrote​:

On 19 Jan 2010, at 21​:30, Zefram wrote​:

Mark Blackman wrote​:

"warning​: concatenating octet and character oriented strings, double
encoding may result" at line N.
Perl doesn't distinguish between characters and octets, and retrofitting
such a distinction is pretty infeasible. So there's no prospect of such
a warning. Much like how we can't add a warning "adding apple count to
orange count".

-zefram

The concatenation point made sense to me. I interpreted that to imply
that the UTF8 flag was a perfect proxy for "character semantics apply
here and if it's missing it's byte semantics" but I think that idea was
rejected by zefram and another poster.

In particular, this phrase by zefram

"Perl doesn't distinguish between characters and octets"

seemed to contradict my reading of the pod snippet

"If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have character
semantics"

Perhaps Zefram was trying to indicate that the semantics question only
comes into play during concatenation and that otherwise "Perl" doesn't
pay attention.

I can still imagine a, probably optional, probably quite slow, test in
the concatenation operator that looks at both operands and if it finds
a) either operand has non-ASCII octets without a UTF8 flag, and
b) the other operand does have the UTF8 flag

delivers some kind of warning. The potential for a false positive
is still there, of course, but it might be nice to optionally assume
'ascii' encoding during concatentation and choke on the
latin1-specific octets.

I suspect I'm in a small minority on the "niceness" of that approach.

- Mark

What I was describing in the pod is that if a string contains octets
with the upper bit set (on ASCIIish machines), but the string doesn't
have its UTF8 flag set, those octets don't have any semantics beyond
their ordinal number and that they don't have any other properties. That
pod (and elsewhere) calls this "byte semantics."

If that string is concatenated with a string that does have the UTF8
flag set, all such octets suddenly change to having Unicode (called
"character") semantics.

I didn't want to mention the UTF8 flag detail at that point, as I
believed it to be too internal for that portion of the documentation. I
didn't come up with the terms "byte" or "character" semantics; don't
particularly like them; but I think they're pretty entrenched.

Are there suggestions for improving the wording?

If that character/bytes distinction only applies during concatenation,
then perhaps that point should be made there. i.e "This distinction in
semantics is only relevant in the context of concatentation".

And, as far as Mark's point in the last snippet, yes I think such a
warning could be done. I imagine it would be slow, so would have to be
in a separate warning category which could be turned off separately
(until the perltodo gets done to turn individual warnings off). It could
be sped up some if is_ascii_string() were recoded to do word-by-word
searching. But it would still be slow.

The problem is finding someone who wants to do this code change. My
efforts are being expended so that if one uses "use feature
'unicode_strings'", that the problem doesn't arise in the first place.

Another poster indicated just such a module already exists, I believe,
but I didn't get a chance to test it for my case.

http​://search.cpan.org/dist/encoding-warnings/lib/encoding/warnings.pm

- Mark

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From zefram@fysh.org

Mark Blackman wrote​:

The concatenation point made sense to me. I interpreted that to imply
that the UTF8 flag was a perfect proxy for "character semantics apply
here and if it's missing it's byte semantics" but I think that idea was
rejected by zefram and another poster.

Referring to the UTF-8 flag when describing concatenation is misleading.
Referring to the UTF-8 flag as "character semantics" versus "byte
semantics" is also somewhat misleading. The underlying problem is that
you're thinking of a Perl string as a byte sequence plus this flag
(under whatever name). It's not. It's a character sequence, which
as an implementation detail can be represented in more than one way.
The overlying problem is that you're polluting readers' minds with the
faulty mental model.

String concatenation concatenates sequences of characters. That's it.

If your string logically consists of octets, rather than characters,
then from Perl's point of view you have a string of Latin-1 characters
(the first 256 codepoints of Unicode). There's no flag on the string to
say that it's logically octets rather than characters. Octets are thus
aliased to the Latin-1 characters; this is what I mean about Perl not
distinguishing between characters and octets. Of course, concatenation
of those Latin-1 characters also serves to concatenate the octets that
they're aliased to, so fortunately the one concatenation operator serves
for both uses.

Note that in describing concatenation I haven't referred to how the
character strings are represented (i.e., the UTF-8 flag). That's because
it's not relevant to the logic of concatenation.

The UTF-8 flag *does* make a difference to semantics in a few situations.
This is rightly regarded as a bug, and the range of situations affected
is shrinking with each major release. Users do need some guidance on how
to work around the bug, but I do not think it is helpful to teach them to
predict which way the flag will be set after ordinary string operations
(which is what "If strings operating under byte semantics ..." was
doing). The correct advice is for the user to apply utf8​::upgrade()
or utf8​::downgrade() immediately before the buggy operation. These are
best not described as "conversion" operations or similar​: they affect the
internal representation but have no effect on the content of the string.
They make no difference to correctly-working operations.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From @khwilliamson

I didn't follow much of this, so maybe it will become clear to me after
settling this snippet​:

Zefram wrote​:

Mark Blackman wrote​:

The concatenation point made sense to me. I interpreted that to imply
that the UTF8 flag was a perfect proxy for "character semantics apply
here and if it's missing it's byte semantics" but I think that idea was
rejected by zefram and another poster.

Referring to the UTF-8 flag when describing concatenation is misleading.
Referring to the UTF-8 flag as "character semantics" versus "byte
semantics" is also somewhat misleading. The underlying problem is that
you're thinking of a Perl string as a byte sequence plus this flag
(under whatever name). It's not. It's a character sequence, which
as an implementation detail can be represented in more than one way.
The overlying problem is that you're polluting readers' minds with the
faulty mental model.

String concatenation concatenates sequences of characters. That's it.

If your string logically consists of octets, rather than characters,
then from Perl's point of view you have a string of Latin-1 characters
(the first 256 codepoints of Unicode). There's no flag on the string to
say that it's logically octets rather than characters. Octets are thus
aliased to the Latin-1 characters; this is what I mean about Perl not
distinguishing between characters and octets. Of course, concatenation
of those Latin-1 characters also serves to concatenate the octets that
they're aliased to, so fortunately the one concatenation operator serves
for both uses.

If I understand correctly what you mean, I believe it to be wrong.

Octets are not aliased to the latin1 character set. Without the UTF8
flag being set, the 128 octets which don't have their upper bit set are
indeed aliased to the ASCII code set, BUT the other 128 octets are
essentially just fillers. They do not have any latin1 properties except
for their ordinal numbers. When the UTF8 flag is set, the latter 128
octets are interpreted as parts of a longer sequence of octets that
together represent the non-ASCII Unicode characters, including those in
the Latin1 Supplement, but these require two octets each.

The only ways to have those upper-bit-set 128 octets mean Latin1
currently are one of the following​:
1) Run on an EBCDIC platform where all octets have a meaning
2) Be in a "use locale" that is Latin1
3) Be in the scope of "use feature 'unicode_strings'" which in 5.12
causes (only) the case changing functions to treat these octets as Latin1.

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From @khwilliamson

karl williamson wrote​:

I didn't follow much of this, so maybe it will become clear to me after
settling this snippet​:

Zefram wrote​:

Mark Blackman wrote​:

The concatenation point made sense to me. I interpreted that to
imply that the UTF8 flag was a perfect proxy for "character
semantics apply here and if it's missing it's byte semantics" but I
think that idea was rejected by zefram and another poster.

Referring to the UTF-8 flag when describing concatenation is misleading.
Referring to the UTF-8 flag as "character semantics" versus "byte
semantics" is also somewhat misleading. The underlying problem is that
you're thinking of a Perl string as a byte sequence plus this flag
(under whatever name). It's not. It's a character sequence, which
as an implementation detail can be represented in more than one way.
The overlying problem is that you're polluting readers' minds with the
faulty mental model.

String concatenation concatenates sequences of characters. That's it.

If your string logically consists of octets, rather than characters,
then from Perl's point of view you have a string of Latin-1 characters
(the first 256 codepoints of Unicode). There's no flag on the string to
say that it's logically octets rather than characters. Octets are thus
aliased to the Latin-1 characters; this is what I mean about Perl not
distinguishing between characters and octets. Of course, concatenation
of those Latin-1 characters also serves to concatenate the octets that
they're aliased to, so fortunately the one concatenation operator serves
for both uses.

If I understand correctly what you mean, I believe it to be wrong.

Octets are not aliased to the latin1 character set. Without the UTF8
flag being set, the 128 octets which don't have their upper bit set are
indeed aliased to the ASCII code set, BUT the other 128 octets are
essentially just fillers. They do not have any latin1 properties except
for their ordinal numbers. When the UTF8 flag is set, the latter 128
octets are interpreted as parts of a longer sequence of octets that
together represent the non-ASCII Unicode characters, including those in
the Latin1 Supplement, but these require two octets each.

I realize I could have been clearer. When the UTF8 flag is set, octets
which have their upper bit set are still not interpreted as latin1.
Instead they have their utf8 meanings, which means they are each part of
some longer sequence of octets, which taken as a whole has a meaning.

The only ways to have those upper-bit-set 128 octets mean Latin1
currently are one of the following​:
1) Run on an EBCDIC platform where all octets have a meaning
2) Be in a "use locale" that is Latin1
3) Be in the scope of "use feature 'unicode_strings'" which in 5.12
causes (only) the case changing functions to treat these octets as Latin1.

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From zefram@fysh.org

karl williamson wrote​:

Octets are not aliased to the latin1 character set. Without the UTF8
flag being set, the 128 octets which don't have their upper bit set are
indeed aliased to the ASCII code set, BUT the other 128 octets are
essentially just fillers.

You're describing here the pre-Unicode character semantics, which
influence the old versions of /\w/ et al. In this situation,
the character set is effectively ASCII plus 128 codepoints that lack
character attributes.

                       When the UTF8 flag is set\, the latter 128  

octets are interpreted as parts of a longer sequence of octets that
together represent the non-ASCII Unicode characters,

Here you're making the fundamental error of confusing the octets used
internally in the string representation with the octets being represented
by the string. The octets of the representation are indeed grouped in
the manner you describe, but that's an implementation detail.

If you're representing a string of octets, they don't magically get
grouped and reinterpreted as funky characters just because you change
the representation. Indeed, that'd be a rather losing representation
if they did. (See Perl 5.6.)

If you utf8​::upgrade an octet string, the string still contains the
same number of characters/octets that it used to. But now some octets
of your data take two octets internally to represent.

Octet values and the Latin-1 characters are the same entities in the sense
that this is the identity that is preserved by basic string operations.
Upgrading or downgrading a string doesn't change what it represents,
just how it is represented.

Now, semantics. In some situations, Perl will still give you the
ASCII-plus-upper-half semantics that it used to. That doesn't mean
that codepoints 0x80 to 0xff aren't Latin-1 characters, it's just
that their character attributes aren't being used in this operation.
This is a feature of the operation, not of the data. In several cases,
the operation decides which set of behaviour it's going to use based on
how the string is encoded. That's a bug.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From @khwilliamson

Zefram wrote​:

karl williamson wrote​:

Octets are not aliased to the latin1 character set. Without the UTF8
flag being set, the 128 octets which don't have their upper bit set are
indeed aliased to the ASCII code set, BUT the other 128 octets are
essentially just fillers.

You're describing here the pre-Unicode character semantics, which
influence the old versions of /\w/ et al. In this situation,
the character set is effectively ASCII plus 128 codepoints that lack
character attributes.

But this is not the old version of \w. The current version of \w works
this way in 5.11.5.

                       When the UTF8 flag is set\, the latter 128  

octets are interpreted as parts of a longer sequence of octets that
together represent the non-ASCII Unicode characters,

Here you're making the fundamental error of confusing the octets used
internally in the string representation with the octets being represented
by the string. The octets of the representation are indeed grouped in
the manner you describe, but that's an implementation detail.

If you're representing a string of octets, they don't magically get
grouped and reinterpreted as funky characters just because you change
the representation. Indeed, that'd be a rather losing representation
if they did. (See Perl 5.6.)

If you utf8​::upgrade an octet string, the string still contains the
same number of characters/octets that it used to. But now some octets
of your data take two octets internally to represent.

Octet values and the Latin-1 characters are the same entities in the sense
that this is the identity that is preserved by basic string operations.
Upgrading or downgrading a string doesn't change what it represents,
just how it is represented.

Now, semantics. In some situations, Perl will still give you the
ASCII-plus-upper-half semantics that it used to. That doesn't mean
that codepoints 0x80 to 0xff aren't Latin-1 characters, it's just
that their character attributes aren't being used in this operation.
This is a feature of the operation, not of the data. In several cases,
the operation decides which set of behaviour it's going to use based on
how the string is encoded. That's a bug.

I really think I don't understand you, and vice versa. When you said
octet, I heard bit pattern. What I meant was that at no time (outside
locales) does, for example, the bit pattern 10000101 (= 0x85) mean NEL,
which is the latin1 interpretation of it. If you dump a scalar and
there is an octet with those bits in it, it means two different things
to operations that care about such things. If the scalar doesn't have
the UTF8 flag set, 0x85 means just one of those non-ASCII characters
with the ordinal value of 133. If the scalar does have the UTF8 flag
set, 0x85 is part of a sequence of other octets that taken together mean
  some character. The sequence of the two octets C3 85, for example
means LATIN CAPITAL LETTER A WITH RING ABOVE, another Latin1 character.

Certainly when one does a uft8​:upgrade, the internal string
representation changes to utf8. By doing this upgrade, the programmer
has told Perl that the scalar was intended to have Unicode semantics,
that any 0x85 in it should be treated as a NEL. But without that
upgrade, Perl assumes the old pre-Unicode character semantics for 0x85
and any other octet with its upper bit set.

Perhaps Zefram thinks that Unicode semantics is the default for Perl, or
the goal to which we are striving. It is not. Perl requires some
action to get Unicode semantics unless a scalar has a code point that
won't fit in an octet. I would have liked to make Unicode semantics the
default, but it appears that there is too much existing code that
depends on it not being so.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Mar 9, 2010

From zefram@fysh.org

karl williamson wrote​:

                        What I meant was that at no time \(outside  

locales) does, for example, the bit pattern 10000101 (= 0x85) mean NEL,
which is the latin1 interpretation of it.

The level of meaning that you're talking about is a feature of
the operations, not the data. Perl could have made octet (or
ASCII-plus-upper-half) strings a distinct data type from Unicode
character strings. It didn't​: it aliased them.

You're still focussing too much on the representation. When you say "bit
pattern" there, I suspect that you're thinking strictly of bit patterns
in the RAM behind the SvPV buffer of a scalar. (Though you're ignoring
ECC bits.) Those are not the most interesting bit patterns relating
to a string. Octets in the SvPV are not the same thing as octets in a
Perl-language string, and I've been talking more about the latter than
the former.

-zefram

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants