Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XS converts to UTF-8 but not back #6001

Closed
p5pRT opened this issue Oct 10, 2002 · 5 comments
Closed

XS converts to UTF-8 but not back #6001

p5pRT opened this issue Oct 10, 2002 · 5 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 10, 2002

Migrated from rt.perl.org#17848 (status was 'resolved')

Searchable as RT17848$

@p5pRT
Copy link
Author

p5pRT commented Oct 10, 2002

From jacob.mandelson@overture.com

Passing strings to XS native methods converts them to UTF-8, but
strings so returned are not converted back from UTF-8.

Testcase​: Trivial XS method​:
  const char *nop(a)
  const char *a;
  CODE​:
  RETVAL = a;
  OUTPUT​:
  RETVAL

Calling this with​:
  use xs;
  my $i = "\x{FF66}";
  map { printf "%04x", ord($_); } split("", $i);
  print " -> ";
  map { printf "%04x", ord($_); } split("", xs​::nop($i));
  print "\n";

gives​:
ff66 -> 00ef00bd00a6

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl v5.8.0:

Configured by jlm at Mon Sep 16 17:06:29 PDT 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.7-10, archname=i686-linux
    uname='linux felix 2.4.7-10 #1 thu sep 6 17:27:27 edt 2001 i686 unknown '
    config_args='-de'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-O3',
    cppflags='-fno-strict-aliasing -I/usr/include/gdbm'
    ccversion='', gccversion='3.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
    libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.2.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    


@INC for perl v5.8.0:
    /home/jlm/projects/goto/lib/perllib
    /home/jlm/projects/goto/src/perlsrc
    /usr/local/BerkeleyDB.3.3/lib/site_perl/5.6.0
    /usr/local/BerkeleyDB.3.3/lib/site_perl
    /usr/local/lib/perl5/5.8.0/i686-linux
    /usr/local/lib/perl5/5.8.0
    /usr/local/lib/perl5
    /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl
    /home/jlm/.share/lib/perl5/site_perl/5.6.1
    /home/jlm/.share/lib/perl5/site_perl
    /usr/local/lib/perl5/5.8.0/i686-linux
    /usr/local/lib/perl5/5.8.0
    /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/jlm
    LANG=C
    LANGUAGE (unset)
    LC_ALL=C
    LD_LIBRARY_PATH=/usrlocal/lib:/lib:/usr/local/qt/lib
    LOGDIR (unset)
    PATH=/home/jlm/.bin:/usrlocal/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/local/java/bin:/usr/local/sbin:/usr/local/kde/bin:/usr/local/bin/gmt:/usr/X11R6/bin.old:/opt/qt/bin:/usrlocal/bin/Java:/usr/local/qt/bin
    PERL5LIB=/home/jlm/projects/goto/lib/perllib:/home/jlm/projects/goto/src/perlsrc:/usr/local/BerkeleyDB.3.3/lib/site_perl:/usr/local/lib/perl5:/usr/local/lib/perl5/site_perl:/home/jlm/.share/lib/perl5/site_perl
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Oct 11, 2002

From nick.ing-simmons@elixent.com

Jacob.Mandelson@​Overture.Com <perl5-porters@​perl.org> writes​:

# New Ticket Created by jacob.mandelson@​overture.com
# Please include the string​: [perl #17848]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=17848 >

This is a bug report for perl from jacob.mandelson@​overture.com
generated with the help of perlbug 1.34 running under perl v5.8.0.

-----------------------------------------------------------------
[Please enter your report here]

Passing strings to XS native methods converts them to UTF-8, but
strings so returned are not converted back from UTF-8.

Testcase​: Trivial XS method​:
const char *nop(a)
const char *a;
CODE​:
RETVAL = a;
OUTPUT​:
RETVAL

Calling this with​:
use xs;
my $i = "\x{FF66}";
map { printf "%04x", ord($_); } split("", $i);
print " -> ";
map { printf "%04x", ord($_); } split("", xs​::nop($i));
print "\n";

gives​:
ff66 -> 00ef00bd00a6

Passing the string to XS does nothing to it at all.
Likewise the input typemap for char * tells XS nothing about what those octets
are. And then the output typemap for char * does nothing special to
mark the new SV as UTF-8.

Given the XS code I don't see how to do better - going from an SV to a char *
you loose information (not just UTF-8 ness - an embedded '\0' will
truncate the string as well).

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=high
---
Site configuration information for perl v5.8.0​:

Configured by jlm at Mon Sep 16 17​:06​:29 PDT 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration​:
Platform​:
osname=linux, osvers=2.4.7-10, archname=i686-linux
uname='linux felix 2.4.7-10 #1 thu sep 6 17​:27​:27 edt 2001 i686 unknown '
config_args='-de'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
optimize='-O3',
cppflags='-fno-strict-aliasing -I/usr/include/gdbm'
ccversion='', gccversion='3.1', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries​:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lgdbm -ldb -ldl -lm -lc -lcrypt -lutil
perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil
libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.2.4'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:

---
@​INC for perl v5.8.0​:
/home/jlm/projects/goto/lib/perllib
/home/jlm/projects/goto/src/perlsrc
/usr/local/BerkeleyDB.3.3/lib/site_perl/5.6.0
/usr/local/BerkeleyDB.3.3/lib/site_perl
/usr/local/lib/perl5/5.8.0/i686-linux
/usr/local/lib/perl5/5.8.0
/usr/local/lib/perl5
/usr/local/lib/perl5/site_perl/5.8.0/i686-linux
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl
/home/jlm/.share/lib/perl5/site_perl/5.6.1
/home/jlm/.share/lib/perl5/site_perl
/usr/local/lib/perl5/5.8.0/i686-linux
/usr/local/lib/perl5/5.8.0
/usr/local/lib/perl5/site_perl/5.8.0/i686-linux
/usr/local/lib/perl5/site_perl/5.8.0
/usr/local/lib/perl5/site_perl
.

---
Environment for perl v5.8.0​:
HOME=/home/jlm
LANG=C
LANGUAGE (unset)
LC_ALL=C
LD_LIBRARY_PATH=/usrlocal/lib​:/lib​:/usr/local/qt/lib
LOGDIR (unset)
PATH=/home/jlm/.bin​:/usrlocal/bin​:/usr/local/bin​:/bin​:/usr/bin​:/usr/X11R6/bin​:/sbin​:/usr/sbin​:/usr/local/java/bin​:/usr/local/sbin​:/usr/local/kde/bin​:/usr/local/bin/gmt​:/usr/X11R6/bin.old​:/opt/qt/bin​:/usrlocal/bin/Java​:/usr/local/qt/bin
PERL5LIB=/home/jlm/projects/goto/lib/perllib​:/home/jlm/projects/goto/src/perlsrc​:/usr/local/BerkeleyDB.3.3/lib/site_perl​:/usr/local/lib/perl5​:/usr/local/lib/perl5/site_perl​:/home/jlm/.share/lib/perl5/site_perl
PERL_BADLANG (unset)
SHELL=/usr/bin/zsh
--
Nick Ing-Simmons
http​://www.ni-s.u-net.com/

@p5pRT
Copy link
Author

p5pRT commented Oct 11, 2002

From jacob.mandelson@overture.com

Passing the string to XS does nothing to it at all.
Likewise the input typemap for char * tells XS nothing about what those octets
are. And then the output typemap for char * does nothing special to
mark the new SV as UTF-8.

Given the XS code I don't see how to do better - going from an SV to a char *
you loose information (not just UTF-8 ness - an embedded '\0' will
truncate the string as well).

There is apparently the convention that the octets passed in to XS are
UTF-8​: The FF66 in the Perl becomes EFBDA6 in the C. This convention
should be obeyed for the return. Instead, EFBDA6 returned from C is
interepreted as ISO 8859-1 and becomes 00EF00BD00A6. (Or some other
convention should be applied both ways, but I think UTF-8 is sensible.)

With the present asymmetry, how do you propose to return strings from native
code?

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2002

From @jhi

As Nick Ing-Simmons, I do not see an easy way out here. The lone char * passed in to the XS simply doesn't carry enough information. The XS cannot know whether the char * contains eight-bit bytes or UTF-8 encoded data (or, if using modules like Encode, bytes of any other encoding, like UTF-16, or Latin-2, or ShiftJIS, or KOI8-R, or...). One must pass in whole SVs, that way one can at least test for SvUTF8(sv), which would indicate that the data is in UTF-8, and one can use the APIs described in perlguts. After Perl 5.8.0 there's been a patch for perlunicode that also explains Unicode/UTF-8 for XS writers​: http​://public.activestate.com/cgi-bin/perlbrowse?filename=&action=patch&patch=17959&.submit=Submit&.cgifields=action
I'm marking the issue as resolved.

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2002

@jhi - Status changed from 'new' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant