New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perl's unicode conversion fails when iconv succeeds [rt.cpan.org #73623] #11833
Comments
From perl-diddler@tlinx.orgThis is a bug report for perl from perl-diddler@tlinx.org, [Please describe your issue here] Was looking at ways to do upper/lower case compare, and bumped into Rather than faster, it choked at the beginning of my 98M test file |
From @cpansproutOn Fri Dec 30 10:41:46 2011, LAWalsh wrote:
You‘re right: $ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in The file begins with <FF><FE>. If I use utf-16le explicitly, it does the first line correctly, but This is part of the Encode distribution, for which CPAN is upstream, so -- Father Chrysostomos |
From @cpansprout |
The RT System itself - Status changed from 'new' to 'open' |
@cpansprout - Status changed from 'open' to 'rejected' |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 14:00:32 2011, perlbug-followup@perl.org wrote:
It sounds like it's reading line-by-line, where a line is a sequence of |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > Fix: - my $need2slurp = $use_bom{ find_encoding($to)->name }; |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 17:49:01 2011, ikegami wrote:
Not to be pushy or anything, but where does one apply that fix? I As for the lines in the file I submitted-- they looked like they all had |
From @ikegamiOn Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT <
piconv |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT <
piconv |
From @ikegamiOn Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT <
Probably. And not really relevant. piconv was treating your file as a series of lines ending with 0A *before |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT <
Probably. And not really relevant. piconv was treating your file as a series of lines ending with 0A *before |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 17:49:01 2011, ikegami wrote:
test.out was same size |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 18:44:35 2011, LAWALSH wrote:
anyway. the piconv doesn't do round trip, the way iconv does. Sounds like it might be assuming UTF-16 means BE and not LE? Just a WAG.. |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 18:49:46 2011, LAWALSH wrote:
Yup: cmp -l -b test.in test2.out |
From @ikegamiOn Fri, Dec 30, 2011 at 6:44 PM, Linda A Walsh via RT <
C<< decode('UTF-16', ...) >> both requires a BOM and removes it If you want to keep the BOM, use UTF-16le (the actual encoding) instead of This is unrelated to this ticket. - Eric |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri, Dec 30, 2011 at 6:44 PM, Linda A Walsh via RT <
C<< decode('UTF-16', ...) >> both requires a BOM and removes it If you want to keep the BOM, use UTF-16le (the actual encoding) instead of This is unrelated to this ticket. - Eric |
From @ikegamiOn Fri, Dec 30, 2011 at 7:01 PM, Eric Brine <ikegami@adaelis.com> wrote:
Correction/elaboration: C<< decode('UTF-16', ...) >> both requires a BOM and removes it
...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be instead You need C<< -to UTF-16le >> to use UTF-16le (instead of UTF-16be), but - Eric |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri, Dec 30, 2011 at 7:01 PM, Eric Brine <ikegami@adaelis.com> wrote:
Correction/elaboration: C<< decode('UTF-16', ...) >> both requires a BOM and removes it
...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be instead You need C<< -to UTF-16le >> to use UTF-16le (instead of UTF-16be), but - Eric |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 19:04:31 2011, ikegami@adaelis.com wrote:
Yup: cmp -l -b test.in test2.out
How is that a correction??
Ah, then there's two rubs: 1)...why would encode convert to BE on a LE machine? Seems like exactly 2) since piconv states that is "designed to be a drop in replacement for |
From @ikegamiOn Fri, Dec 30, 2011 at 9:15 PM, Linda A Walsh via RT <
I was correcting what *I* said. 1)...why would encode convert to BE on a LE machine? What does Encode have to do with your machine? 2) since piconv states that is "designed to be a drop in replacement for
Yes. Go ahead a file a bug if you want. |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri, Dec 30, 2011 at 9:15 PM, Linda A Walsh via RT <
I was correcting what *I* said. 1)...why would encode convert to BE on a LE machine? What does Encode have to do with your machine? 2) since piconv states that is "designed to be a drop in replacement for
Yes. Go ahead a file a bug if you want. |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > On Fri Dec 30 23:26:12 2011, ikegami@adaelis.com wrote:
That's where the test was run. Data is usually in the machines native format unless you are
The original test case showed The bug was the piconv didn't work as a drop in for iconv as I took I tried to do the same with piconv, but piconv failed at the first step. Why the original bug report was truncated at the data point, seems to be Perhaps it would be better to report that one as this one is still not |
From zefram@fysh.orgLinda A Walsh via RT wrote:
That was the case in the 1980s. Times have changed; machines are more -zefram |
From bug-Encode@rt.cpan.org<URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > Linda A Walsh via RT wrote:
That was the case in the 1980s. Times have changed; machines are more -zefram |
From perl-diddler@tlinx.orgZefram via RT wrote:
This has NOT changed. It was addressed in the 1980's. If you are Networking, you used network byte order. If you are doing To do otherwise is to incur horrible inefficiencies. You can't do a string search on modern architectures USING their native Intel has string compare assembly instructions that start at the beginning In the west, we read from left to right, so to list numbers, we would On a BE machine you don't know what you will see, because the string is So a BE machine talking to another BE machine of the same word size, may You are choosing to deliberately create inefficience for most of the With BE machines, no generation was compatible with the next. because the Data over the internet for 4-byte or 16-byte addresses is in BE order Perl doesn't represent or store strings in memory on todays machines in On top of all of the above, Piconv was supposed to be a drop in It's not cute, and it's not just quirky, it's simply harmful to anyone How is it that you would want a document in a word order that is alien Was it a particular 'screw you to Microsoft'? Who was the first major ventor |
From @ikegamiOn Tue, Jan 3, 2012 at 6:09 PM, Linda Walsh <perl-diddler@tlinx.org> wrote:
Reading UTF-16le: UV c; Reading UTF-16be: UV c; I don't see anything platform-dependent or any "horrible inefficiencies". - Eric |
From perl-diddler@tlinx.orgEric Brine wrote:
Wouldn't your target be a buffer pointer? So that , above is really *c = *(p++) ... etc... Except that if the count is large, or greater than 4 (normal case) on if you are on a 64bit machine, then 0 SIGBUS handles/loop (done in hw on intel, but you can turn off the HW 1 load, 1 store, and 2 adds/loop *12 million loops (96meg data) *(q++)=*((unsigned int)q++) for any count >=8 vs. *(c++) = *(p++) << 8; at least 14 SIGBUS events/loop + (1 will liekly line up on each side, but Well so far we are at 8 times as many loads and stores 700% overhead + SIGBUS... overhead... 14/loop .. each penalizes a load /store at least so our 8 storeloads get penalized by ***minimum*** (assuming so the 1000% + the intops 2300 -> 3300% overhead/loop or 35x slower.
You don't call a 35X slowdown, or 3300% overhead 'horrible'? Geez.... Might want to re-examine the bad code ... Considering it has to be done for all the chars, just the 4x reduction Being able to examine code like the above is a main reason why (though the market doesn't pay for it, cause they don't car about a 35x |
From @ikegamiOn Wed, Jan 4, 2012 at 4:34 PM, Linda Walsh <perl-diddler@tlinx.org> wrote:
No. Perl doesn't use arrays of codepoints. Even if it did, it doesn't // UTF-16le is not anymore efficient than // UTF-16be Except that if the count is large, or greater than 4 (normal case) on LE machines, you do 4 at a time and skip the shifts(<<) and ors(|): You can't do that for the first 0..3 characters because of alignment issues. You can't do that for the last 0..3 characters because of boundary issues. You can't do that since UTF-16 is a variable width format. (You are *(q++)=*((unsigned int)q++) for any count >=8 Alignment error (not counting the missing "*"). This is the code. Note how UTF-16le ('v') is no faster than UTF-16be ('n'). static UV |
From @paulg1973At the risk of getting into a <flame> war, let me say that the posting by Ms. Walsh contains many statements that I believe are inaccurate. Anyone coming across that post in an archive in the future is advised to draw their own conclusions independent of her statements. I’d be grateful if she would cite sources. In my case, I lived through this history and can offer my personal knowledge. 1. “Big-Endian machines fell out of favor, in part, because they were structurally flawed for string operations.” This is just silly. There are plenty of examples of Big-Endian architectures that have efficient machine instructions for string operations. I personally worked on the Big-Endian Honeywell 6000 mainframes, which had a rich extended instruction set (“EIS”) that handled decimal data (up to 59 digits of precision!) and character-string operations of almost indefinite length. But perhaps you are referring to the original IBM mainframe (System/360), unarguably the most successful mainframe of its day, if not of all time. While it is true that its support for character (aka “logical”) data was fairly minimal, to say that it was structurally deficient is to use 2012 values to judge a 1965 design. It could move logical data, compare logical data, and edit decimal data into logical data; very clever stuff for its time. Big-Endian machines have in fact never fallen out of favor; there are still plenty of successful examples around. And those that did fall out of favor can’t simply blame it on Big-Endian integers; there were plenty of good reasons from the business and management side of the house. 2. An implied assumption that little-endian machines have binary representations that are easier to read (say, in a dump of storage) than big-endian machines. Again, this is silly. I spent years working on big-endian machines and had no trouble reading the dumps. I still find it lots easier to read a big-endian dump than a little-endian dump. In either case, you need a cheat-sheet that shows the storage layout. At least with Big-Endian formats you don’t have to byte-swap the integers. I have yet to find a human that doesn’t have to byte-swap all but the most trivial Little-Endian integer to figure out its decimal value. 3. It makes no sense to use a different endian than native, except in the network world. I work on a very successful operating system (Stratus OpenVOS) in which all user data is big-endian, yet the underlying processor is little-endian. The byte swaps add a negligible overhead, thanks to clever optimizations by Intel. Our design makes perfect business and technical sense; it made the task of porting 25 years of source code from a big-endian processor (HP PA-RISC) to a little-endian processor (x86) a LOT easier for us and for our customers. I understand that Digital Equipment Corp played a similar trick when they ported FORTRAN from their Big-Endian 36-bit machines (PDP-10, PDP-20) to their Little-Endian 16-bit/32-bit machines (PDP-11). 4. The mainframes of the 60s and 70s are extinct because they were incompatible with each other. I was there. They are extinct because they were hugely expensive, not terribly reliable, and over the years, we’ve invented much better products. They were incompatible with each other because (a) just making the darn things work was hard, (b) there was no economic incentive to make them compatible, (c) companies were reinventing the technology at a rapid clip and that required dropping the ideas that didn’t work out (the IBM System/360 was not compatible with the IBM 7094). Also, (d) networking came along pretty late in the game. If you take the years after World War II as the start of modern-day computers, it took 25 years before computer-to-computer networking came into being (the ARPAnet). Networking before the ARPAnet consisted of sneakernet (carrying punch cards, tapes, or removable disks); worked great and was plenty fast enough for the day. 5. Data over the internet is in big-endian order because it is more efficient for routing. I’d love to see your source for this comment. Again, I was there. My memory is that the established machines (of the late 60s and early 70s, when the ARPAnet was being invented) were Big-Endian. The upstarts were Little-Endian. I always thought that the designers just picked BE as the native format because that was the machine they were using at the same and were most familiar with. But I have only my memories for this, not a source. In my opinion, standardization in the computer industry is the sign that innovation has ceased. Or to put it another way, the industry eventually decides that some piece of technology is good enough and there is no good reason to try to improve it. Eventually, some sort of paradigm shift happens and major changes blow away layers of technology, but until then, we have a big incentive to use the stuff that just works. Over my time in the business (late 1960s to today), we’ve gone from having no true standards, to a couple of fairly standard programming languages (COBOL and FORTRAN), to having a fairly standard operating system (Unix/Linux) with a fairly standard programming language (C), to having fairly standard scripting languages (Perl, PHP, et al). The GNU project has been a remarkable success at standardization and has driven out a lot of proprietary technology (anybody remember the tiny compiler companies that used to exist?). On the network side, we started with ftp, graduated to bulletin boards, upgraded to the web with HTML, and now have HTML5. We still have plenty of proprietary technologies (iOS, Windows, BlackBerry OS, to name but 3), but they must constantly battle to stay ahead of the march of fairly standard, commonly-available software (e.g., Android). We even have some fairly standard application packages now (think GIMP). This trend will continue. </flame> PG From: Linda Walsh [mailto:perl-diddler@tlinx.org] Zefram via RT wrote: Linda A Walsh via RT wrote: Data is usually in the machines native format unless you are That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be. This has NOT changed. It was addressed in the 1980's. If you are Networking, you used network byte order. If you are doing processing To do otherwise is to incur horrible inefficiencies. You can't do a string search on modern architectures USING their native Intel has string compare assembly instructions that start at the beginning In the west, we read from left to right, so to list numbers, we would put number On a BE machine you don't know what you will see, because the string is different So a BE machine talking to another BE machine of the same word size, may You are choosing to deliberately create inefficience for most of the world, to With BE machines, no generation was compatible with the next. because the Data over the internet for 4-byte or 16-byte addresses is in BE order because it Perl doesn't represent or store strings in memory on todays machines in BE order On top of all of the above, Piconv was supposed to be a drop in replacement for It's not cute, and it's not just quirky, it's simply harmful to anyone who might How is it that you would want a document in a word order that is alien to your Was it a particular 'screw you to Microsoft'? Who was the first major ventor |
From perl-diddler@tlinx.orgHi Paul, thank you for your well written response. It would take too much work to look for all the details to support every No need to flame, IMO, but some people like it hot. Green, Paul wrote:
.... 59 digits of precision... um, at ~ 2.3 digits/char, that's about 25 how about a right shift in memory by a byte? memmov, can handle it, Thank you for making my point.
Now...that's highly dependent on the type of data, if it was *string* you could have BADCFEHG, and today you'd more likely see HGFEDCBA. All While an 16-bit 8086, or a 32-bit 386, .. no prob .. the 8086 just uses On a BE machine, BUS errors were usually passed on to the program
I'm sure, everyone learns their profession!
I Know nothing about it. I can only say that it would be more
They were hugely expensive because they were all different --
All of those are also excellent reasons for expense and
Actually I wouldn't. The start of the first commercial computers was
I know...initially, I thought the same thing, and I was corrected,
Innovation ceases in an area when an local optimum has been reached. It
That happens when someone finds a completely new way of doing something
Yeah... and in the particular, MS, the dominant OS on the planet and Think of a string compare... -- you just increment a pointer on
That was supposed to be a flame? Naw....flames are when you set about Oh, and in case I wasn't, your mamma wears army boots, so there. |
Migrated from rt.perl.org#107326 (status was 'rejected')
Searchable as RT107326$
The text was updated successfully, but these errors were encountered: