Re: [LONG] Possible utf8 implementation #562

p5pRT · 1999-09-20T02:20:48Z

Migrated from rt.perl.org#1413 (status was 'resolved')

Searchable as RT1413$

p5pRT · 1999-09-20T02:20:48Z

From The RT System itself

Perhaps. But there are quite a few of them, and they are used a lot.
Such a change is likely to make patches to 5.005_62 even more tricky
to keep in step with 5.005_0X etc. than it is now.

hv_store()/hv_fetch as they are currently implemented should be #define's
to: (intentionally extra verbose for the purpose of example)

hv_store__key_is_charp_latin1(...)
hv_fetch__key_is_charp_latin1(...)

How does this help?
How is that different from just leaving hv_store
as is and defining hv_store_key_is_charp_utf8()?

How does does what this leads to i.e.:

if (is_utf8)
result = hv_store_key_is_charp_utf8(hv, )
else
result = hv_store_key_is_charp_latin1();

improve anything at all?
IMHO it just smears out the abstraction all over the sources.
It would be better in my view to change to
result = hv_store_ent(hv, sv, ...)

Which may happen where it is easy.

Option 3 really does suck.

It was never really an option.

As much as it is a performance hit, for native
hashes, there should be zero issues with converting all keys to utf8 the
moment a utf8 string is passed in which contains a character with a
value > 256.

Yes, the new idea since I was stalled on Saturday is the per-hash
flag and the "when 1st needed" semantics.

So - that is hash keys resolved (for now).

Now if "you lot" could discuss the implications

of Perl_croak("Oops %s ...",charptr)

- can we have (say) %ls to say string is utf8 - is this too weird
or already "taken" for wchar_t ?
- always use %_
- another nifty idea from the list?

$@ is an SV so there is no issue with its holding the
value, what is less clear is Perl_warner() & other spots where
Perl_mess() is playing with strings.

There is a minor issue when error message replete with UNICODE hits
the STDERR IO filter that wants bytes. We cannot die printing the
error message printing the error message printing the error message ...

die("Non byte character in %_",ERRSV); // :-(

So do we go for \x{feed} style ?

--
Nick Ing-Simmons

p5pRT · 1999-09-20T02:35:06Z

From The RT System itself

It all depends how badly you want full support for utf8... and how complete
you want to be about it...

hv_store()/hv_fetch as they are currently implemented should be #define's
to: (intentionally extra verbose for the purpose of example)
hv_store__key_is_charp_latin1(...)
hv_fetch__key_is_charp_latin1(...)
How does this help?
How is that different from just leaving hv_store
as is and defining hv_store_key_is_charp_utf8()?

How does does what this leads to i.e.:
if (is_utf8)
result = hv_store_key_is_charp_utf8(hv, )
else
result = hv_store_key_is_charp_latin1();
improve anything at all?

result = hv_store_key_is_sv(....);
result = hv_store_key_is_charp_maybe_utf8(..., utf8flag);

hv_store would then be discouraged.

IMHO it just smears out the abstraction all over the sources.
It would be better in my view to change to
result = hv_store_ent(hv, sv, ...)
Which may happen where it is easy.

I wouldn't mind if hv_store/hv_store_ent took sv's, as shown above... the
problem is source code compatibility with older modules. As long as older
modules continue to work, but assume strings are raw strings, then the
majority of older modules should continue to work.

As much as it is a performance hit, for native
hashes, there should be zero issues with converting all keys to utf8 the
moment a utf8 string is passed in which contains a character with a
value > 256.
Yes, the new idea since I was stalled on Saturday is the per-hash
flag and the "when 1st needed" semantics.
So - that is hash keys resolved (for now).
Now if "you lot" could discuss the implications
of Perl_croak("Oops %s ...",charptr)
- can we have (say) %ls to say string is utf8 - is this too weird
or already "taken" for wchar_t ?
- always use %_
- another nifty idea from the list?
$@ is an SV so there is no issue with its holding the
value, what is less clear is Perl_warner() & other spots where
Perl_mess() is playing with strings.

Umm.. hmm... I don't have an opinion here until somebody brings up some more
of the issues involved here...

Including utf8 strings in your C code... :-)

The "always use %_" should satisfy the majority of the cases, but I don't
see why utf8 charp's should be "not allowed."

There is a minor issue when error message replete with UNICODE hits
the STDERR IO filter that wants bytes. We cannot die printing the
error message printing the error message printing the error message ...
die("Non byte character in %_",ERRSV); // :-(
So do we go for \x{feed} style ?

I think that would would make sense. I certainly don't want
recursive error messages... :-) "Error displaying Error displaying Error ... which contains the string "..."" :-)

mark

--
markm@nortelnetworks.com/mark@mielke.cc/markm@ncf.ca __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | CUE Development (4Y21)
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Nortel Networks
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

p5pRT · 2003-04-22T15:45:09Z

@iabyn - Status changed from 'stalled' to 'resolved'

p5pRT · 2003-04-22T15:49:56Z

@iabyn - Status changed from 'stalled' to 'resolved'

p5pRT closed this as completed Apr 22, 2003

p5pRT added Severity Low documentation labels Oct 18, 2019

p5pRT mentioned this issue Oct 19, 2019

Program terminated with signal 11, Segmentation fault. #12469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re: [LONG] Possible utf8 implementation #562

Re: [LONG] Possible utf8 implementation #562

p5pRT commented Sep 20, 1999

p5pRT commented Sep 20, 1999

p5pRT commented Sep 20, 1999

p5pRT commented Apr 22, 2003

p5pRT commented Apr 22, 2003

Re: [LONG] Possible utf8 implementation #562

Re: [LONG] Possible utf8 implementation #562

Comments

p5pRT commented Sep 20, 1999

p5pRT commented Sep 20, 1999

From The RT System itself

p5pRT commented Sep 20, 1999

From The RT System itself

p5pRT commented Apr 22, 2003

p5pRT commented Apr 22, 2003