Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perl's unicode conversion fails when iconv succeeds [rt.cpan.org #73623] #11833

Closed
p5pRT opened this issue Dec 30, 2011 · 32 comments
Closed

perl's unicode conversion fails when iconv succeeds [rt.cpan.org #73623] #11833

p5pRT opened this issue Dec 30, 2011 · 32 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 30, 2011

Migrated from rt.perl.org#107326 (status was 'rejected')

Searchable as RT107326$

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From perl-diddler@tlinx.org

This is a bug report for perl from perl-diddler@​tlinx.org,
generated with the help of perlbug 1.39 running under perl 5.12.3.


[Please describe your issue here]

Was looking at ways to do upper/lower case compare, and bumped into
piconv as being a 'drop in replacement for "iconv"'. So I decided to try
it thinking it would be a 'hoot' if it was faster.

Rather than faster, it choked at the beginning of my 98M test file
(i.e. I truncated the file to the first few lines, 672 bytes), which
reproduces the problem just fine .. Trés sad...

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From perl-diddler@tlinx.org

test.in

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From @cpansprout

On Fri Dec 30 10​:41​:46 2011, LAWalsh wrote​:

This is a bug report for perl from perl-diddler@​tlinx.org,
generated with the help of perlbug 1.39 running under perl 5.12.3.

-----------------------------------------------------------------
[Please describe your issue here]

Was looking at ways to do upper/lower case compare, and bumped into
piconv as being a 'drop in replacement for "iconv"'. So I decided to try
it thinking it would be a 'hoot' if it was faster.

Rather than faster, it choked at the beginning of my 98M test file
(i.e. I truncated the file to the first few lines, 672 bytes), which
reproduces the problem just fine .. Tr�s sad...

You‘re right​:

$ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in
UTF-16​:Unrecognised BOM d at
/usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line
196, <$ifh> line 2.

The file begins with <FF><FE>.

If I use utf-16le explicitly, it does the first line correctly, but
quickly switches to Chinese, which means it’s off by one byte. If I use
utf-16be explicitly, the first line is in Chinese.

This is part of the Encode distribution, for which CPAN is upstream, so
I’m forwarding this to the CPAN ticket.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From @cpansprout

test.in

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

@cpansprout - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Dec 30, 2011
@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 14​:00​:32 2011, perlbug-followup@​perl.org wrote​:

On Fri Dec 30 10​:41​:46 2011, LAWalsh wrote​:

This is a bug report for perl from perl-diddler@​tlinx.org,
generated with the help of perlbug 1.39 running under perl 5.12.3.

-----------------------------------------------------------------
[Please describe your issue here]

Was looking at ways to do upper/lower case compare, and bumped into
piconv as being a 'drop in replacement for "iconv"'. So I decided
to try
it thinking it would be a 'hoot' if it was faster.

Rather than faster, it choked at the beginning of my 98M test file
(i.e. I truncated the file to the first few lines, 672 bytes), which
reproduces the problem just fine .. Tr�s sad...

You‘re right​:

$ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in
UTF-16​:Unrecognised BOM d at
/usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line
196, <$ifh> line 2.

The file begins with <FF><FE>.

If I use utf-16le explicitly, it does the first line correctly, but
quickly switches to Chinese, which means it’s off by one byte.

It sounds like it's reading line-by-line, where a line is a sequence of
bytes ended by 0A. Of course, that's wrong for UTF-16le (and UTF-16be,
for that matter).

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }


Not to be pushy or anything, but where does one apply that fix? I
couldn't find a any need2slurp in my
/usr/lib/perl5/{5.1{0.0,2.{1,3}}.0,{site,vendor}_perl} library dirs, so
I don't know that the above lines were responsible for this particular
breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had
CRLF as line separators...

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From @ikegami

On Fri, Dec 30, 2011 at 6​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }

----
Not to be pushy or anything, but where does one apply that fix? I
couldn't find a any need2slurp in my
/usr/lib/perl5/{5.1{0.0,2.{1,3}}.0,{site,vendor}_perl} library dirs, so
I don't know that the above lines were responsible for this particular
breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had
CRLF as line separators...

piconv

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri, Dec 30, 2011 at 6​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }

----
Not to be pushy or anything, but where does one apply that fix? I
couldn't find a any need2slurp in my
/usr/lib/perl5/{5.1{0.0,2.{1,3}}.0,{site,vendor}_perl} library dirs, so
I don't know that the above lines were responsible for this particular
breakage...but then I may not be searching in the right spots...

As for the lines in the file I submitted-- they looked like they all had
CRLF as line separators...

piconv

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From @ikegami

On Fri, Dec 30, 2011 at 6​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

As for the lines in the file I submitted-- they looked like they all had
CRLF as line separators...

Probably. And not really relevant.

piconv was treating your file as a series of lines ending with 0A *before
decoding*. LF is not 0A in UTF-16le, and an 0A is not necessarily part of a
LF in UTF-16le.

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri, Dec 30, 2011 at 6​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

As for the lines in the file I submitted-- they looked like they all had
CRLF as line separators...

Probably. And not really relevant.

piconv was treating your file as a series of lines ending with 0A *before
decoding*. LF is not 0A in UTF-16le, and an 0A is not necessarily part of a
LF in UTF-16le.

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

test.out was same size

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 18​:44​:35 2011, LAWALSH wrote​:

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
  ^^^^ typo.. was 'test'...

anyway. the piconv doesn't do round trip, the way iconv does.

Sounds like it might be assuming UTF-16 means BE and not LE?

Just a WAG..

@p5pRT
Copy link
Author

p5pRT commented Dec 30, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 18​:49​:46 2011, LAWALSH wrote​:

On Fri Dec 30 18​:44​:35 2011, LAWALSH wrote​:

# piconv -f UTF-8 -t UTF-16 <test.out >test2.out
# cmp test.in test2.out
test.in test2.out differ​: byte 1, line 1 test.out was same size

Sounds like it might be assuming UTF-16 means BE and not LE?


Yup​:

cmp -l -b test.in test2.out
  1 377 M-^? 376 M-~
  2 376 M-~ 377 M-^?
  3 127 W 0 ^@​
  4 0 ^@​ 127 W
  5 151 i 0 ^@​
...
671 12 ^J 0 ^@​
672 0 ^@​ 134 \
cmp​: EOF on test.in

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From @ikegami

On Fri, Dec 30, 2011 at 6​:44 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

C<< decode('UTF-16', ...) >> both requires a BOM and removes it
(intentionally).

If you want to keep the BOM, use UTF-16le (the actual encoding) instead of
UTF-16.

This is unrelated to this ticket.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri, Dec 30, 2011 at 6​:44 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

C<< decode('UTF-16', ...) >> both requires a BOM and removes it
(intentionally).

If you want to keep the BOM, use UTF-16le (the actual encoding) instead of
UTF-16.

This is unrelated to this ticket.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From @ikegami

On Fri, Dec 30, 2011 at 7​:01 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Fri, Dec 30, 2011 at 6​:44 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

Correction/elaboration​:

C<< decode('UTF-16', ...) >> both requires a BOM and removes it

(intentionally).

...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be instead
of UTF-16le.

You need C<< -to UTF-16le >> to use UTF-16le (instead of UTF-16be), but
that won't add the BOM, you need to avoid removing it in the first place by
using C<< -from UTF-16le >>.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri, Dec 30, 2011 at 7​:01 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Fri, Dec 30, 2011 at 6​:44 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

Correction/elaboration​:

C<< decode('UTF-16', ...) >> both requires a BOM and removes it

(intentionally).

...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be instead
of UTF-16le.

You need C<< -to UTF-16le >> to use UTF-16le (instead of UTF-16be), but
that won't add the BOM, you need to avoid removing it in the first place by
using C<< -from UTF-16le >>.

- Eric

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 19​:04​:31 2011, ikegami@​adaelis.com wrote​:

On Fri, Dec 30, 2011 at 7​:01 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Fri, Dec 30, 2011 at 6​:44 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 17​:49​:01 2011, ikegami wrote​:

Fix​:

- my $need2slurp = $use_bom{ find_encoding($to)->name };
+ my $need2slurp = $use_bom{ find_encoding($from)->name };
+ if ($Opt{debug}){
+ printf "Read mode​: %s\n", $need2slurp ? 'Slurp' : 'Line';
+ }
=====
Partly works​:
piconv -f UTF-16 -t UTF-8 <test.in >test.out
iconv -f UTF-16 -t UTF-8 <test.in >testi.out
cmp testi.out test.out && echo ok
ok
piconv -f UTF-8 -t UTF-16 <test.out >test2.out
cmp testi.in test2.out
test.in test2.out differ​: byte 1, line 1

Sounds like it might be assuming UTF-16 means BE and not LE?


Yup​: cmp -l -b test.in test2.out
1 377 M-^? 376 M-~
2 376 M-~ 377 M-^?

Correction/elaboration​:

C<< decode('UTF-16', ...) >> both requires a BOM and removes it

(intentionally).


How is that a correction??

...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be
instead
of UTF-16le.


Ah, then there's two rubs​:

1)...why would encode convert to BE on a LE machine? Seems like exactly
the wrong decision to make.

2) since piconv states that is "designed to be a drop in replacement for
iconv" and "iconv seems to assume LE", (maybe it only does so on LE
machines?)... then I would assert there is a still a problem.

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From @ikegami

On Fri, Dec 30, 2011 at 9​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >
How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE", (maybe it only does so on LE
machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri, Dec 30, 2011 at 9​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >
How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE", (maybe it only does so on LE
machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.

@p5pRT
Copy link
Author

p5pRT commented Dec 31, 2011

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

On Fri Dec 30 23​:26​:12 2011, ikegami@​adaelis.com wrote​:

On Fri, Dec 30, 2011 at 9​:15 PM, Linda A Walsh via RT <
bug-Encode@​rt.cpan.org> wrote​:

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >
How is that a correction??

I was correcting what *I* said.

1)...why would encode convert to BE on a LE machine?

What does Encode have to do with your machine?


That's where the test was run.

Data is usually in the machines native format unless you are
specifically trying to export it somewhere (like over the Net, then
'network byte order' is used).

2) since piconv states that is "designed to be a drop in replacement for

iconv" and "iconv seems to assume LE", (maybe it only does so on LE
machines?)... then I would assert there is a still a problem.

Yes. Go ahead a file a bug if you want.


The original test case showed
using iconv 2 directions... for some reason the perlbug SW chopped that
off .. anything after the uuencoded file I included, ws chopped off...
that had a whole explanation and demonstration of the bug using the
above data file (above in the original bug report that seems to have
been corrupted by perl's bug system).

The bug was the piconv didn't work as a drop in for iconv as I took
a simple doc and converted to utf-8 and then back to utf-16, and
original and the twice converted compared identical.

I tried to do the same with piconv, but piconv failed at the first step.

Why the original bug report was truncated at the data point, seems to be
another bug in the perlbug reporting system.

Perhaps it would be better to report that one as this one is still not
fixed as the title perl';s conversion fails when iconv succeeds is still
true. That's why I said 'closer', but not quite there.

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2012

From zefram@fysh.org

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are
specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more
interconnected than they used to be.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2012

From bug-Encode@rt.cpan.org

<URL​: https://rt.cpan.org/Ticket/Display.html?id=73623 >

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are
specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more
interconnected than they used to be.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jan 3, 2012

From perl-diddler@tlinx.org

Zefram via RT wrote​:

Linda A Walsh via RT wrote​:

Data is usually in the machines native format unless you are
specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be.
-zefram


  This has NOT changed. It was addressed in the 1980's.

If you are Networking, you used network byte order. If you are doing
processing
on the same machine, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

You can't do a string search on modern architectures USING their native
instruction sets, if you put data in an ALIEN format.

Intel has string compare assembly instructions that start at the beginning
of a byte string, and go from the start of the string, in low memory, (even
on BE machines -- which is one reason they fell out of favor, they were
structurally
flawed for string operations).

In the west, we read from left to right, so to list numbers, we would
put number
in the order : 0 1 2 3 4 5 6 7 8 9 11 12 This is the same order that
today's computers use. low(starting) memory on the left, and you place
bytes
into memory in human readable order. If you look at memory, you would see
0 1 2 3 4 5 ... (or 30 20 31 20 33 20, where the 3x = the numbers, and
the 20=
the spaces).

On a BE machine you don't know what you will see, because the string is
different
depending on the word size used to store the string and the native word size
of the machine. When the network standard was defined only 32-bit BE
machines
were at all prevalent. so as numbers, if stored in NETWORD, byte order,
(not
necessarily BE order, as BE is always relative to the word size....
I.e. if I packed
them as an arrary, at 16-bit intervals, I would see​:
2 1 4 3 6 5 8 7 10 9 12 11. If I packed them into 32 bit words first,
I'd see
4 3 2 1 8 7 6 5 12 11 10 9. If I packed them into a 64bit word (we have
64-bit
machines today), we'd see
8 7 6 5 4 3 2 1 0 0 0 0 12 11 10 9. If you packed them into a 1 byte
array, and
looked at memory as bytes., you'd see the same order as you see in all
3 cases
on a LE machine. That's the advantage and likely the biggest reason why LE
machines are dominant today. It doesn't matter if you pack them in as
bytes,
words, 32-bit DWORDS, or 64-bit QWORDS, the order is the same.

So a BE machine talking to another BE machine of the same word size, may
benefit by putting them in BE order, BUT, the majority of computers used by
consumers and in the IT world, for processing are LE based. It makes
no sense
to default to a format that they can't use their native instruction set
on without
'converting'.

You are choosing to deliberately create inefficience for most of the
world, to
follow the example of the 1960's/1970's mainframes that are now extinct
for a
reason -- they didn't work well together, and each was specialized ...
Now we
build things out of parts and build up, so a subrouting getting a
parameter, doesn't care
if you passed it 1 byte, on a 8080, or 2 bytes on an 8088, or 4 bytes on
a 586,
or 8 bytes on a x86_64 bit machine... ALL of those subroutines wold work --
unmodified, since the low byte is always 1st in memory, and that's what
they
pay attention to.

With BE machines, no generation was compatible with the next. because the
native byte order would be different at each word size.

Data over the internet for 4-byte or 16-byte addresses is in BE order
because it
makes sense for routing equipment that has to look at the high parts
first, for
routing decisions, just like you look at
(country)(province/state)(city)(street)(street addr)[subnumber]. It is
most efficient
for address parsing in network equipment to be able to look at the most
significant
parts first. But humans? and Computers doing internal work? A human can
never look at the 1st digit and make any sense of it. A computer can,
only if
it knows the length of the item coming in (which is
usually the case in a language (a sub that takes a byte, a word, or
whatever)...

Perl doesn't represent or store strings in memory on todays machines in
BE order
unless it is running on a BE machine. It is an error for conversion of
characters
to default to non-native order, as that's NEVER the _default_,
internally -- only
in the circumstance of some explicit specification would it use
non-native format.
It just doesn't make any sense.

On top of all of the above, Piconv was supposed to be a drop in
replacement for
iconv. It was meant to be a "demonstration of the unicode technology
in Perl" --
it's a BIG FAIL, if it doesn't generate the same output. It cannot be
used as a
drop in replacement due -- why? For the same reason why printf/sprintf
aren't
parallel on perl, or the same reason why their is duplicate code in the
perl interpreter
for "use" and "require", so that the case for "use", can "special case"
(quirk)
functionality to "disallow" the same logical functionality that
"require" possesses.

It's not cute, and it's not just quirky, it's simply harmful to anyone
who might
want perl to evolve into something that didn't have so many odd and
non-intuitive
exceptions.

How is it that you would want a document in a word order that is alien
to your
machine (when you are known to be exporting?
why you'd convert to a non-native format when the next most likely thing
to do would be to process that document locally?

Was it a particular 'screw you to Microsoft'? Who was the first major ventor
to define and Use 16-bit values for unicode and who did so in 'LE'. Even
apple went to Intel (though who knows if they will stay w/it)... but
who uses
BE machine and to whom, is the current behavior/default useful to?

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2012

From @ikegami

On Tue, Jan 3, 2012 at 6​:09 PM, Linda Walsh <perl-diddler@​tlinx.org> wrote​:

**

If you are Networking, you used network byte order. If you are doing
processing
on the same machine, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

Reading UTF-16le​:

UV c;
c = *(p++);
c |= *(p++) << 8;

Reading UTF-16be​:

UV c;
c = *(p++) << 8;
c |= *(p++);

I don't see anything platform-dependent or any "horrible inefficiencies".

- Eric

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2012

From perl-diddler@tlinx.org

Eric Brine wrote​:

On Tue, Jan 3, 2012 at 6​:09 PM, Linda Walsh <perl-diddler@​tlinx.org
<mailto​:perl-diddler@​tlinx.org>> wrote​:

If you are Networking\, you used network byte order\.� If you are
doing processing
on the same machine\, you use native byte order\.

To do otherwise is to incur horrible inefficiencies\.

Reading UTF-16le​:

UV c;
c = *(p++);
c |= *(p++) << 8;


Wouldn't your target be a buffer pointer?
I.e. because you are converting from one buffer
to another? (and I always get chided for not showing
my work...)

So that , above is really *c = *(p++) ... etc...

Except that if the count is large, or greater than 4 (normal case) on
LE machines, you do 4 at a time and skip the shifts(<<) and ors(|)​:

if you are on a 64bit machine, then
if w=cc, (1word composed of 2 chars) and d=ww, and q=dd
if w=cc, and d=ww, then to unpack your LE string, You'd divide the length
by 8 so on a 96 meg file I like I was using, you'd
do 12million *1 store on of course you'd make sure it was aligned
on a 64-bit boundary... thus you incur no SIGBUS's that have to be
handled in hardware that slow you down.

0 SIGBUS handles/loop (done in hw on intel, but you can turn off the HW
handling and have it take a SIGBUS for any non-aligned data, and you'll
realize
how much data is pushed around unaligned, taking at least twice as long just
for the memory accesses, not counting the SIGBUS service time (even if it is
in HW)...

1 load, 1 store, and 2 adds/loop *12 million loops (96meg data)

*(q++)=*((unsigned int)q++) for any count >=8
that's 1 assign,

vs.

*(c++) = *(p++) << 8;
*(c++) |= *(p++);
*(c++) = *(p++) << 8;
*(c++) |= *(p++);
*(c++) = *(p++) << 8;
*(c++) |= *(p++);
*(c++) = *(p++) << 8;
*(c++) |= *(p++);

at least 14 SIGBUS events/loop + (1 will liekly line up on each side, but
7/8 times they won't.
vs. 8 loads and stores
+ 16 adds, + 8 shifts, masks and ors. (the mask is implicit if you are
using a character data type. -- because has to be loaded into a register
from memory first and the top 24 or 56 bits of memory (32/64 bit) have to
be masked off to get you your byte.
There might be more masks depending on the types, but lets just
call it 1,
so 8 load & stores, 16 add/loop, 16 mask and 16 ORS,
The or's mask and adds are likely close to each other in speed (
with in an order of magnitude... so 48 of those.
the 8 loads & stores

Well so far we are at 8 times as many loads and stores 700% overhead
and 48 int-ops, vs. 2, or 24 times as many, (2300% overhead),

+ SIGBUS... overhead... 14/loop .. each penalizes a load /store at least
by 2x, (has to hit 2 memory positions,

so our 8 storeloads get penalized by ***minimum*** (assuming
0 time to process the SIGBUS and just load memory), ) an extra 14/loop,
so that's really 22 v 2 = load-n-stores, or 11x, that's 1000% overhead...

so the 1000% + the intops 2300 -> 3300% overhead/loop or 35x slower.

I don't see anything platform-dependent or any "horrible inefficiencies".

You don't call a 35X slowdown, or 3300% overhead 'horrible'?

Geez....

Might want to re-examine the bad code ...

Considering it has to be done for all the chars, just the 4x reduction
in loop iterations, would be a bonus., let alone removal of all those
extraneous ops...

Being able to examine code like the above is a main reason why
everyone should have basic computer science education in this day and age,
though a degree is helpful...

(though the market doesn't pay for it, cause they don't car about a 35x
slowdown).
Consumers can just wait...

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2012

From @ikegami

On Wed, Jan 4, 2012 at 4​:34 PM, Linda Walsh <perl-diddler@​tlinx.org> wrote​:

**

Eric Brine wrote​:

On Tue, Jan 3, 2012 at 6​:09 PM, Linda Walsh <perl-diddler@​tlinx.org>wrote​:

If you are Networking, you used network byte order.� If you are doing
processing
on the same machine, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

Reading UTF-16le​:

UV c;
c = *(p++);
c |= *(p++) << 8;

----

Wouldn't your target be a buffer pointer?

No. Perl doesn't use arrays of codepoints. Even if it did, it doesn't
change anything anyway.

// UTF-16le
UV* c = ...;
*c = *(p++);
*(c++) |= *(p++) << 8;

is not anymore efficient than

// UTF-16be
UV* c = ...;
*c = *(p++) << 8;
*(c++) |= *(p++);

Except that if the count is large, or greater than 4 (normal case) on

LE machines, you do 4 at a time and skip the shifts(<<) and ors(|)​:

You can't do that for the first 0..3 characters because of alignment issues.

You can't do that for the last 0..3 characters because of boundary issues.

You can't do that since UTF-16 is a variable width format. (You are
incorrectly creating two characters in the destination buffer where there
is only one.)

*(q++)=*((unsigned int)q++) for any count >=8

Alignment error (not counting the missing "*").


This is the code. Note how UTF-16le ('v') is no faster than UTF-16be ('n').

static UV
enc_unpack(pTHX_ U8 **sp, U8 *e, STRLEN size, U8 endian)
{
  U8 *s = *sp;
  UV v = 0;
  if (s+size > e) {
  croak("Partial character %c",(char) endian);
  }
  switch(endian) {
  case 'N'​:
  v = *s++;
  v = (v << 8) | *s++;
  case 'n'​:
  v = (v << 8) | *s++;
  v = (v << 8) | *s++;
  break;
  case 'V'​:
  case 'v'​:
  v |= *s++;
  v |= (*s++ << 8);
  if (endian == 'v')
  break;
  v |= (*s++ << 16);
  v |= (*s++ << 24);
  break;
  default​:
  croak("Unknown endian %c",(char) endian);
  break;
  }
  *sp = s;
  return v;
}

@p5pRT
Copy link
Author

p5pRT commented Jan 4, 2012

From @paulg1973

At the risk of getting into a <flame> war, let me say that the posting by Ms. Walsh contains many statements that I believe are inaccurate. Anyone coming across that post in an archive in the future is advised to draw their own conclusions independent of her statements. I’d be grateful if she would cite sources. In my case, I lived through this history and can offer my personal knowledge.

1. “Big-Endian machines fell out of favor, in part, because they were structurally flawed for string operations.”

This is just silly. There are plenty of examples of Big-Endian architectures that have efficient machine instructions for string operations. I personally worked on the Big-Endian Honeywell 6000 mainframes, which had a rich extended instruction set (“EIS”) that handled decimal data (up to 59 digits of precision!) and character-string operations of almost indefinite length. But perhaps you are referring to the original IBM mainframe (System/360), unarguably the most successful mainframe of its day, if not of all time. While it is true that its support for character (aka “logical”) data was fairly minimal, to say that it was structurally deficient is to use 2012 values to judge a 1965 design. It could move logical data, compare logical data, and edit decimal data into logical data; very clever stuff for its time. Big-Endian machines have in fact never fallen out of favor; there are still plenty of successful examples around. And those that did fall out of favor can’t simply blame it on Big-Endian integers; there were plenty of good reasons from the business and management side of the house.

2. An implied assumption that little-endian machines have binary representations that are easier to read (say, in a dump of storage) than big-endian machines.

Again, this is silly. I spent years working on big-endian machines and had no trouble reading the dumps. I still find it lots easier to read a big-endian dump than a little-endian dump. In either case, you need a cheat-sheet that shows the storage layout. At least with Big-Endian formats you don’t have to byte-swap the integers. I have yet to find a human that doesn’t have to byte-swap all but the most trivial Little-Endian integer to figure out its decimal value.

3. It makes no sense to use a different endian than native, except in the network world.

I work on a very successful operating system (Stratus OpenVOS) in which all user data is big-endian, yet the underlying processor is little-endian. The byte swaps add a negligible overhead, thanks to clever optimizations by Intel. Our design makes perfect business and technical sense; it made the task of porting 25 years of source code from a big-endian processor (HP PA-RISC) to a little-endian processor (x86) a LOT easier for us and for our customers. I understand that Digital Equipment Corp played a similar trick when they ported FORTRAN from their Big-Endian 36-bit machines (PDP-10, PDP-20) to their Little-Endian 16-bit/32-bit machines (PDP-11).

4. The mainframes of the 60s and 70s are extinct because they were incompatible with each other.

I was there. They are extinct because they were hugely expensive, not terribly reliable, and over the years, we’ve invented much better products. They were incompatible with each other because (a) just making the darn things work was hard, (b) there was no economic incentive to make them compatible, (c) companies were reinventing the technology at a rapid clip and that required dropping the ideas that didn’t work out (the IBM System/360 was not compatible with the IBM 7094). Also, (d) networking came along pretty late in the game. If you take the years after World War II as the start of modern-day computers, it took 25 years before computer-to-computer networking came into being (the ARPAnet). Networking before the ARPAnet consisted of sneakernet (carrying punch cards, tapes, or removable disks); worked great and was plenty fast enough for the day.

5. Data over the internet is in big-endian order because it is more efficient for routing.

I’d love to see your source for this comment. Again, I was there. My memory is that the established machines (of the late 60s and early 70s, when the ARPAnet was being invented) were Big-Endian. The upstarts were Little-Endian. I always thought that the designers just picked BE as the native format because that was the machine they were using at the same and were most familiar with. But I have only my memories for this, not a source.

In my opinion, standardization in the computer industry is the sign that innovation has ceased. Or to put it another way, the industry eventually decides that some piece of technology is good enough and there is no good reason to try to improve it. Eventually, some sort of paradigm shift happens and major changes blow away layers of technology, but until then, we have a big incentive to use the stuff that just works. Over my time in the business (late 1960s to today), we’ve gone from having no true standards, to a couple of fairly standard programming languages (COBOL and FORTRAN), to having a fairly standard operating system (Unix/Linux) with a fairly standard programming language (C), to having fairly standard scripting languages (Perl, PHP, et al). The GNU project has been a remarkable success at standardization and has driven out a lot of proprietary technology (anybody remember the tiny compiler companies that used to exist?). On the network side, we started with ftp, graduated to bulletin boards, upgraded to the web with HTML, and now have HTML5. We still have plenty of proprietary technologies (iOS, Windows, BlackBerry OS, to name but 3), but they must constantly battle to stay ahead of the march of fairly standard, commonly-available software (e.g., Android). We even have some fairly standard application packages now (think GIMP). This trend will continue.

</flame>

PG

From​: Linda Walsh [mailto​:perl-diddler@​tlinx.org]
Sent​: Tuesday, January 03, 2012 6​:09 PM
To​: perlbug-followup@​perl.org
Subject​: Re​: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds

Zefram via RT wrote​:

Linda A Walsh via RT wrote​:
 

  Data is usually in the machines native format unless you are
  specifically trying to export it somewhere
 

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be.
-zefram
 


  This has NOT changed. It was addressed in the 1980's.

If you are Networking, you used network byte order. If you are doing processing
on the same machine, you use native byte order.

To do otherwise is to incur horrible inefficiencies.

You can't do a string search on modern architectures USING their native
instruction sets, if you put data in an ALIEN format.

Intel has string compare assembly instructions that start at the beginning
of a byte string, and go from the start of the string, in low memory, (even
on BE machines -- which is one reason they fell out of favor, they were structurally
flawed for string operations).

In the west, we read from left to right, so to list numbers, we would put number
in the order : 0 1 2 3 4 5 6 7 8 9 11 12 This is the same order that
today's computers use. low(starting) memory on the left, and you place bytes
into memory in human readable order. If you look at memory, you would see
0 1 2 3 4 5 ... (or 30 20 31 20 33 20, where the 3x = the numbers, and the 20=
the spaces).

On a BE machine you don't know what you will see, because the string is different
depending on the word size used to store the string and the native word size
of the machine. When the network standard was defined only 32-bit BE machines
were at all prevalent. so as numbers, if stored in NETWORD, byte order, (not
necessarily BE order, as BE is always relative to the word size.... I.e. if I packed
them as an arrary, at 16-bit intervals, I would see​:
2 1 4 3 6 5 8 7 10 9 12 11. If I packed them into 32 bit words first, I'd see
4 3 2 1 8 7 6 5 12 11 10 9. If I packed them into a 64bit word (we have 64-bit
machines today), we'd see
8 7 6 5 4 3 2 1 0 0 0 0 12 11 10 9. If you packed them into a 1 byte array, and
looked at memory as bytes., you'd see the same order as you see in all 3 cases
on a LE machine. That's the advantage and likely the biggest reason why LE
machines are dominant today. It doesn't matter if you pack them in as bytes,
words, 32-bit DWORDS, or 64-bit QWORDS, the order is the same.

So a BE machine talking to another BE machine of the same word size, may
benefit by putting them in BE order, BUT, the majority of computers used by
consumers and in the IT world, for processing are LE based. It makes no sense
to default to a format that they can't use their native instruction set on without
'converting'.

You are choosing to deliberately create inefficience for most of the world, to
follow the example of the 1960's/1970's mainframes that are now extinct for a
reason -- they didn't work well together, and each was specialized ... Now we
build things out of parts and build up, so a subrouting getting a parameter, doesn't care
if you passed it 1 byte, on a 8080, or 2 bytes on an 8088, or 4 bytes on a 586,
or 8 bytes on a x86_64 bit machine... ALL of those subroutines wold work --
unmodified, since the low byte is always 1st in memory, and that's what they
pay attention to.

With BE machines, no generation was compatible with the next. because the
native byte order would be different at each word size.

Data over the internet for 4-byte or 16-byte addresses is in BE order because it
makes sense for routing equipment that has to look at the high parts first, for
routing decisions, just like you look at
(country)(province/state)(city)(street)(street addr)[subnumber]. It is most efficient
for address parsing in network equipment to be able to look at the most significant
parts first. But humans? and Computers doing internal work? A human can
never look at the 1st digit and make any sense of it. A computer can, only if
it knows the length of the item coming in (which is
usually the case in a language (a sub that takes a byte, a word, or whatever)...

Perl doesn't represent or store strings in memory on todays machines in BE order
unless it is running on a BE machine. It is an error for conversion of characters
to default to non-native order, as that's NEVER the _default_, internally -- only
in the circumstance of some explicit specification would it use non-native format.
It just doesn't make any sense.

On top of all of the above, Piconv was supposed to be a drop in replacement for
iconv. It was meant to be a "demonstration of the unicode technology in Perl" --
it's a BIG FAIL, if it doesn't generate the same output. It cannot be used as a
drop in replacement due -- why? For the same reason why printf/sprintf aren't
parallel on perl, or the same reason why their is duplicate code in the perl interpreter
for "use" and "require", so that the case for "use", can "special case" (quirk)
functionality to "disallow" the same logical functionality that "require" possesses.

It's not cute, and it's not just quirky, it's simply harmful to anyone who might
want perl to evolve into something that didn't have so many odd and non-intuitive
exceptions.

How is it that you would want a document in a word order that is alien to your
machine (when you are known to be exporting?
why you'd convert to a non-native format when the next most likely thing
to do would be to process that document locally?

Was it a particular 'screw you to Microsoft'? Who was the first major ventor
to define and Use 16-bit values for unicode and who did so in 'LE'. Even
apple went to Intel (though who knows if they will stay w/it)... but who uses
BE machine and to whom, is the current behavior/default useful to?

@p5pRT
Copy link
Author

p5pRT commented Jan 6, 2012

From perl-diddler@tlinx.org

Hi Paul, thank you for your well written response.

It would take too much work to look for all the details to support every
sentence I said,
but I will address some of the specifics.

No need to flame, IMO, but some people like it hot.

Green, Paul wrote​:

At the risk of getting into a <flame> war, let me say that the posting
by Ms. Walsh contains many statements that I believe are inaccurate.
Anyone coming across that post in an archive in the future is advised
to draw their own conclusions independent of her statements. I’d be
grateful if she would cite sources. In my case, I lived through this
history and can offer my personal knowledge.

1. “Big-Endian machines fell out of favor, in part, because they
were structurally flawed for string operations.”


.... 59 digits of precision... um, at ~ 2.3 digits/char, that's about 25
whole chars
of string length!. WOW!.. and how many books will fit in that????

  how about a right shift in memory by a byte? memmov, can handle it,
and how well can a BE machine handle that? Each 32 or 16 bit workd ..
or 60 bit word
has to be unpacked, shifted, internally and propagated to the next
word. It's a nightmare.

Thank you for making my point.

2. An implied assumption that little-endian machines have binary
representations that are easier to read (say, in a dump of storage)
than big-endian machines.

Now...that's highly dependent on the type of data, if it was *string*
data, which is
what we are talking about, organizing your dump to print from low on the
left to high on the right, you see things in alphabetical order. On a
BE machine, you'll not see a natural ordering because the strings will
go DCBAHGFE instead of ABCDEFGH, If I'm looking for strings in
non-reversed mode, I find abcdefgh ordering much easier to read that a
byte-swapped order...of course with BE machines, you had multiple types.

you could have BADCFEHG, and today you'd more likely see HGFEDCBA. All
of them played so nice together.

While an 16-bit 8086, or a 32-bit 386, .. no prob .. the 8086 just uses
2 words, low+high aligned, and the 386 used 1 DWORD, no switching...they
are inherently compatible with each other. If one side thinks it is
passing an int, and the other side only returns 8 significant bits
(like exit), no problem, -10 is -10 -- you pass -10 in a 32-bit
register to a process expecting an 8-bit int, no problem! They are
automatically they are compatible, but on a BE machine. Everything must
match. you can't just look at the low bits and expect sanity. Sure,
everything 'should be perfect anyway', well, we know how well that
expectation works.

On a BE machine, BUS errors were usually passed on to the program
because you couldn't expect misaligned data to be read correctly,
because reading an int from bytes 0-3, was very different than reading
sn int that was stored at a horrid, offset 1-4 in a struct. It may run
slower, but it has the flexibility to still run, on a BE machine, not an
option.

Again, this is silly. I spent years working on big-endian machines and
had no trouble reading the dumps.


  I'm sure, everyone learns their profession!

3. It makes no sense to use a different endian than native,
except in the network world.

I work on a very successful operating system (Stratus OpenVOS) in
which all user data is big-endian, yet the underlying processor is
little-endian. The byte swaps add a negligible overhead, thanks to
clever optimizations by Intel. Our design makes perfect business and
technical sense; it made the task of porting 25 years of source code
from a big-endian processor (HP PA-RISC) to a little-endian processor
(x86) a LOT easier for us and for our customers. I understand that
Digital Equipment Corp played a similar trick when they ported FORTRAN
from their Big-Endian 36-bit machines (PDP-10, PDP-20) to their
Little-Endian 16-bit/32-bit machines (PDP-11).

I Know nothing about it. I can only say that it would be more
efficient if it didn't have to byte-switch, ***BUT*** given the work
required to convert it in software, you might never recoup the money
spent changing the software to be native ending.

4. The mainframes of the 60s and 70s are extinct because they
were incompatible with each other.

I was there. They are extinct because they were hugely expensive, not
terribly reliable, and over the years, we’ve invented much better
products.


  They were hugely expensive because they were all different --
i.e. not interchangeable, -- not commodity parts, couldn't substitute
one for the other -- i.e. they were all incompatible. As PC's became
dominant, it forced the incompatible makers into smaller and smaller
niche markets. Even Apple finally folded and went with Intel. You
are agreeing with me! But you put out processors from Intel,
AMD, and a few others, and the binaries _can_ be portable between
machines. You'd never expect that between BE machines... there were too
many ways to be different. Too many ways to aligned that word, but with
LE, there's only 1 way, so already, you've narrowed the compatibility
tremendously.

They were incompatible with each other because (a) just making the
darn things work was hard, (b) there was no economic incentive to make
them compatible, (c) companies were reinventing the technology at a
rapid clip and that required dropping the ideas that didn’t work out
(the IBM System/360 was not compatible with the IBM 7094). Also, (d)
networking came along pretty late in the game.

All of those are also excellent reasons for expense and
incompatibility. It's not just
one reason, I would agree, and sorry if you feel I said that was the
only reason.

If you take the years after World War II as the start of modern-day
computers

Actually I wouldn't. The start of the first commercial computers was
around the early 50's, but I would consider the integrated circuit to be
the start of modern computing, since everything has 'sorta' been a
shrink of that tech... and that was around 58-59. It took 6 years from
there for ARPA to fund it's first network, and ARPANET was first used in
69...

5. Data over the internet is in big-endian order because it is
more efficient for routing.

I’d love to see your source for this comment. Again, I was there. My
memory is that the established machines (of the late 60s and early
70s, when the ARPAnet was being invented) were Big-Endian. The
upstarts were Little-Endian. I always thought that the designers just
picked BE as the native format because that was the machine they were
using at the same and were most familiar with. But I have only my
memories for this, not a source.


  I know...initially, I thought the same thing, and I was corrected,
by remembering
my routing lessons -- especially when reading the routing about IPV6!
It may not have been much of a factor, but with longer addressing,
except for those interested in auditing, you'll find fewer ipv6 routers
needing to look at the full address in order to do their function. I
can't imagine the same wasn't true on a smaller scale with smaller
machines, but I didn't want to blame it all on the fact that larger
machines were just more in vogue then! I was trying to give a little
bit of credit to the design...??? But if what you say is true, that's
only another reason to NOT use BE order -- and choosing to default to BE
order is a sign of anachronistic programming from the 70's! How that
could find it's way into perl that didn't exist back then, is beyond me!

In my opinion, standardization in the computer industry is the sign
that innovation has ceased. Or to put it another way, the industry
eventually decides that some piece of technology is good enough and
there is no good reason to try to improve it.

Innovation ceases in an area when an local optimum has been reached. It
becomes increasingly difficult to innovate when you are at or near a
local optimum.

Eventually, some sort of paradigm shift happens and major changes blow
away layers of technology,

That happens when someone finds a completely new way of doing something
that creates (usually) a non-local optimum that is better than the
current. Thus, the usually accompanying major upheaval. A whole
chapter or three goes into this in "Artificial Intelligence, A Modern
approach" (Russell and Norvig), it's also a standard feature in game
theory/design.

but until then, we have a big incentive to use the stuff that just works.


  Yeah... and in the particular, MS, the dominant OS on the planet and
Intel, the dominant architecture use LE, so why would someone put
something in BE -- and ordering that has all but died out along with
the architectures that used them. Why would anyone put something in a
dead architecture
language by default, on a LE machine.

  Think of a string compare... -- you just increment a pointer on
both...but on a BE machine, you have to unpack it to get the order
right. Same with UTF16BE... guaranteed to have to unpack it to get the
order right, but it if is UTF16LE, then as long as you don't run out of
the first 64K, you can just use an incremental 16-bit compare. Code
plane usage isn't that frequent with western chars (unfortunately, or MS
would have them working better in Win7! a giant leap backward from XP for
font support -- unbelievable!)....

</flame>

That was supposed to be a flame? Naw....flames are when you set about
to burn the other person...I didn't get that impression, disagreeing?
Common. But you weren't sufficiently rude, obnoxious or
berating....you'll really have to work on that... ;-)

Oh, and in case I wasn't, your mamma wears army boots, so there.
(gettin' into serious flamage here!)
-linda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant