Re: utf8 implementation: shared keys #541

p5pRT · 1999-09-20T01:45:04Z

Migrated from rt.perl.org#1388 (status was 'resolved')

Searchable as RT1388$

p5pRT · 1999-09-20T01:45:04Z

From The RT System itself

That's fair.

But ... why do we need to dis-associate utf8 and byte in this way?
So long as we can tell they are not "equal" we may as well share the bytes.
So we just have a flag which sets the utf8-ness of this shared value.
We hash-em the same, store them the same, but two entries only match
of the utf8-ness matches as well as hash, length, and bytes.
Now I can have one clump-of-bytes
Because I don't believe that these strings will coincide _exactly_.
So why have multiple sets of overhead and time consuming decision process?

To creater tighter hashes, and hopefully those "time consuming decisions"
would only be made once for every string either offered as input from
some source that would be considered tainted (i.e. external data), or
once at compile time for strings specified at the source level.

In fact, as I mentioned above, I believe that the datatab string table
I mention above is likely not worth storing at all.
I know I need it - it allows the app. to run without swapping...

This is fair. Although, it might be worth allowing app's which use
binary data for some other purpose to disable shared keys for that
specific hash.

What we need is a better algorithym that will be able to eliminate
hash keys with only one reference after some unspecified period.
Shared keys are only useful if the key is actually shared.
Tighter hashes have the potential to be faster for the simple reason that
they take up less pages of memory.

Given an analysis of the string contents,
Um, that sounds slow.
Not if we keep Larry Wall's suggested 7-bit clean optimization in place.
But that means finding another flag bit. Which may mean making TYPEMASK
0x7F rather than 0xFF and converting a byte read in to a read +
and-with-mask.

Well, you could always OR all the bytes in sequence and check the 8th
bit when you are done.

(Crossing my fingers... :-) ) The majority of cases should be able to be
determined at compile time.
That is far from clear.

The only part unclear is how much effort would be required in adding
this optimization into perl. It is for sure possible.

And assuming that all operations that combine
strings, are operatore on them, propagate the bits whenever it is accurate
to do so, the slow down should be fairly minimal. (And hopefully, absolutely
minimal for "normal" perl...)
For instance, if there are ZERO uses of utf8 code anywhere so far, there
is no need at all to perform the extra analysis to determine if the
string is 7-bit clean. Only on the first occurence of "use utf8;" or
"\x{...}" or some other set of specifications, does this analysis need
to begin to take place.
Having a huge hash with many keys that are only referenced once is slow.

I really think you have to go all the way, or else you'll regret it later.

Of course, perhaps smaller steps are possible.

mark

--
markm@nortelnetworks.com/mark@mielke.cc/markm@ncf.ca __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | CUE Development (4Y21)
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Nortel Networks
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

p5pRT · 2003-04-22T16:19:42Z

@iabyn - Status changed from 'stalled' to 'resolved'

p5pRT closed this as completed Apr 22, 2003

p5pRT added Severity Low documentation labels Oct 18, 2019

This was referenced Oct 18, 2019

Mail Error #7662

Closed

Program terminated with signal 11, Segmentation fault. #12469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re: utf8 implementation: shared keys #541

Re: utf8 implementation: shared keys #541

p5pRT commented Sep 20, 1999

p5pRT commented Sep 20, 1999

p5pRT commented Apr 22, 2003

Re: utf8 implementation: shared keys #541

Re: utf8 implementation: shared keys #541

Comments

p5pRT commented Sep 20, 1999

p5pRT commented Sep 20, 1999

From The RT System itself

p5pRT commented Apr 22, 2003