Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wide character in subroutine entry, DB_File #15083

Closed
p5pRT opened this issue Dec 8, 2015 · 11 comments
Closed

Wide character in subroutine entry, DB_File #15083

p5pRT opened this issue Dec 8, 2015 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Dec 8, 2015

Migrated from rt.perl.org#126849 (status was 'resolved')

Searchable as RT126849$

@p5pRT
Copy link
Author

p5pRT commented Dec 8, 2015

From frederik@ofb.net

Created by frederik@ofb.net

The following program produces the error "Wide character in subroutine
entry at ./bug-example line 23.". I guess it means that DB_File does
not support UTF-8. I notice that when using BerkeleyDB, it works. I
had some trouble debugging this and wanted to suggest some
improvements​:

1. perldiag mentions "Wide character in %s" but not "Wide character in
subroutine entry". The description for the former talks about
filehandles and binmode, while "Wide character in subroutine entry"
seems to demand a use of encode(...). Perhaps the "subroutine enry"
version of the message should be described specially or separately in
perldiag.

2. I guess DB_File is a bit old, but I chose it because I don't need
any of the BerkeleyDB features like cursors, and I value backwards
compatibility. Perhaps the man page should mention that it doesn't
work with UTF-8, which would have changed my decision. Or the man page
could even mention that one needs to encode("utf-8", $_) on keys.

3. Then again, DB_File could be updated to support UTF-8.

Thanks so much for a great programming language.

  #!/bin/perl

  use strict;
  use utf8;
  use BerkeleyDB;
  use DB_File;
  use Encode;

  $\ = "\n";

  my $dbf = "xx.db";
  unlink $dbf;

  my %h;

  # tie %h, "BerkeleyDB​::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
  tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;

  my @​ents;
  # @​ents = map {decode("utf-8", $_)} @​ARGV;
  @​ents = decode("utf-8", encode("utf-8",'œ'));

  for(@​ents) { $h{$_} = 1; }

  print join("\n", keys %h);

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.22.0:

Configured by builduser at Tue Jun  2 09:45:08 CEST 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration:
   
  Platform:
    osname=linux, osvers=4.0.4-2-arch, archname=x86_64-linux-thread-multi
    uname='linux flo-64 4.0.4-2-arch #1 smp preempt fri may 22 03:05:23 utc 2015 x86_64 gnulinux '
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4 -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-size=4',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion='', gccversion='5.1.0', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-unknown-linux-gnu/5.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.21'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.22.0:
    /home/frederik/scripts-misc/perl
    /home/frederik/.local/lib/perl5/x86_64-linux-thread-multi
    /home/frederik/.local/lib/perl5
    /usr/lib/perl5/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/core_perl
    /usr/share/perl5/core_perl
    .


Environment for perl 5.22.0:
    HOME=/home/frederik
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/frederik/.local/arch/x86_64/lib:/home/frederik/.local/lib:/usr/local/lib
    LOGDIR (unset)
    PATH=/home/frederik/.local/bin:/home/frederik/projects/mailproc:/home/frederik/scripts-misc:/home/frederik/.local/arch/x86_64/bin:/usr/bin/core_perl:/usr/bin/vendor_perl:/usr/bin/site_perl:/usr/local/bin:/usr/local/sbin:/usr/bin
    PERL5LIB=/home/frederik/scripts-misc/perl:/home/frederik/.local/lib/perl5
    PERL_BADLANG (unset)
    PERL_LOCAL_LIB_ROOT=/home/frederik/.local/:/home/frederik/.local/:/home/frederik/.local/:/home/frederik/.local/
    PERL_MB_OPT=--install_base "/home/frederik/.local/"
    PERL_MM_OPT=INSTALL_BASE=/home/frederik/.local/
    SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2015

From @tonycoz

On Tue Dec 08 14​:33​:46 2015, frederik@​ofb.net wrote​:

The following program produces the error "Wide character in subroutine
entry at ./bug-example line 23.". I guess it means that DB_File does
not support UTF-8. I notice that when using BerkeleyDB, it works. I
had some trouble debugging this and wanted to suggest some
improvements​:

BerkeleyDB simply isn't warning about the lack of UT8-8 support.

If I add the following to then end of your code​:

my @​keys = keys %h;
print $keys[0] eq $ents[0] ? "match" : "no match";

and uncomment the BerkeleyDB tie, you'll see that the key you supplied
doesn't match the key that the database is storing.

Luckily both BerkeleyDB and DB_File have a mechanism to automatically process
both keys and values, for DB_File​:

use DBM_Filter;
my $db = tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;
$db->Filter_Key_Push('utf8');

for BerkeleyDB​:

my $db = tie %h, "BerkeleyDB​::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
$db->filter_store_key(sub { utf8​::encode($_) });
$db->filter_fetch_key(sub { utf8​::decode($_) });

Here I'm only processing the keys, see the documentation on processing the values instead (or as well).

(perldoc DBM_Filter claims to support BerkeleyDB, but doesn't appear to.)

1. perldiag mentions "Wide character in %s" but not "Wide character in
subroutine entry". The description for the former talks about
filehandles and binmode, while "Wide character in subroutine entry"
seems to demand a use of encode(...). Perhaps the "subroutine enry"
version of the message should be described specially or separately in
perldiag.

That warning is caused by the XS code for DB_File calling SvPVbyte(), and it
happens that the entersub ("subroutine entry") op used to call the XS code
is active at that point.

I'm not sure explaining that would be useful to a normal user reading the documentation.

2. I guess DB_File is a bit old, but I chose it because I don't need
any of the BerkeleyDB features like cursors, and I value backwards
compatibility. Perhaps the man page should mention that it doesn't
work with UTF-8, which would have changed my decision. Or the man page
could even mention that one needs to encode("utf-8", $_) on keys.

3. Then again, DB_File could be updated to support UTF-8.

DB_File is CPAN upstream and is maintained by the same author as BerkeleyDB.

CPAN upstream issues should be reported upstream, see https://rt.cpan.org/Public/Dist/Display.html?Name=DB_File

Tony

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2015

From @eserte

Dana Uto 08. Pro 2015, 14​:33​:46, frederik@​ofb.net reče​:

This is a bug report for perl from frederik@​ofb.net,
generated with the help of perlbug 1.40 running under perl 5.22.0.

-----------------------------------------------------------------
[Please describe your issue here]

The following program produces the error "Wide character in subroutine
entry at ./bug-example line 23.". I guess it means that DB_File does
not support UTF-8. I notice that when using BerkeleyDB, it works. I
had some trouble debugging this and wanted to suggest some
improvements​:

1. perldiag mentions "Wide character in %s" but not "Wide character in
subroutine entry". The description for the former talks about
filehandles and binmode, while "Wide character in subroutine entry"
seems to demand a use of encode(...). Perhaps the "subroutine enry"
version of the message should be described specially or separately in
perldiag.

2. I guess DB_File is a bit old, but I chose it because I don't need
any of the BerkeleyDB features like cursors, and I value backwards
compatibility. Perhaps the man page should mention that it doesn't
work with UTF-8, which would have changed my decision. Or the man page
could even mention that one needs to encode("utf-8", $_) on keys.

3. Then again, DB_File could be updated to support UTF-8.

Thanks so much for a great programming language.

#!/bin/perl

use strict;
use utf8;
use BerkeleyDB;
use DB_File;
use Encode;

$\ = "\n";

my $dbf = "xx.db";
unlink $dbf;

my %h;

# tie %h, "BerkeleyDB​::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;

my @​ents;
# @​ents = map {decode("utf-8", $_)} @​ARGV;
@​ents = decode("utf-8", encode("utf-8",'œ'));

for(@​ents) { $h{$_} = 1; }

print join("\n", keys %h);

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=low
---
Site configuration information for perl 5.22.0​:

Configured by builduser at Tue Jun 2 09​:45​:08 CEST 2015.

Summary of my perl5 (revision 5 version 22 subversion 0)
configuration​:

Platform​:
osname=linux, osvers=4.0.4-2-arch, archname=x86_64-linux-thread-
multi
uname='linux flo-64 4.0.4-2-arch #1 smp preempt fri may 22 03​:05​:23
utc 2015 x86_64 gnulinux '
config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64
-mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-
size=4 -Dprefix=/usr -Dvendorprefix=/usr
-Dprivlib=/usr/share/perl5/core_perl
-Darchlib=/usr/lib/perl5/core_perl
-Dsitelib=/usr/share/perl5/site_perl
-Dsitearch=/usr/lib/perl5/site_perl
-Dvendorlib=/usr/share/perl5/vendor_perl
-Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl
-Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl
-Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl
-Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-
needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-
aliasing -pipe -fstack-protector-strong -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-
strong --param=ssp-buffer-size=4',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing
-pipe -fstack-protector-strong -I/usr/local/include'
ccversion='', gccversion='5.1.0', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678,
doublekind=3
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16,
longdblkind=3
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro
-fstack-protector-strong -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib/gcc/x86_64-unknown-linux-
gnu/5.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib
/lib64 /usr/lib64
libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.21'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-
needed,-z,relro -L/usr/local/lib -fstack-protector-strong'

---
@​INC for perl 5.22.0​:
/home/frederik/scripts-misc/perl
/home/frederik/.local/lib/perl5/x86_64-linux-thread-multi
/home/frederik/.local/lib/perl5
/usr/lib/perl5/site_perl
/usr/share/perl5/site_perl
/usr/lib/perl5/vendor_perl
/usr/share/perl5/vendor_perl
/usr/lib/perl5/core_perl
/usr/share/perl5/core_perl
.

---
Environment for perl 5.22.0​:
HOME=/home/frederik
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH=/home/frederik/.local/arch/x86_64/lib​:/home/frederik/.local/lib​:/usr/local/lib
LOGDIR (unset)
PATH=/home/frederik/.local/bin​:/home/frederik/projects/mailproc​:/home/frederik/scripts-
misc​:/home/frederik/.local/arch/x86_64/bin​:/usr/bin/core_perl​:/usr/bin/vendor_perl​:/usr/bin/site_perl​:/usr/local/bin​:/usr/local/sbin​:/usr/bin
PERL5LIB=/home/frederik/scripts-
misc/perl​:/home/frederik/.local/lib/perl5
PERL_BADLANG (unset)
PERL_LOCAL_LIB_ROOT=/home/frederik/.local/​:/home/frederik/.local/​:/home/frederik/.local/​:/home/frederik/.local/
PERL_MB_OPT=--install_base "/home/frederik/.local/"
PERL_MM_OPT=INSTALL_BASE=/home/frederik/.local/
SHELL=/bin/zsh

DB_File (and the underlying berkeley db engine, I guess) can handle only binary (or octets or latin1) data. There's no way to specify a specific encoding, especially for "wide characters". But if you know that you have to store data in the utf8 encoding, then you can define "DBM filters" which do the translation from wide characters into octets and vice versa automatically​:

  for my $filter (qw(filter_store_key filter_store_value)) {
  (tied %h)->$filter(sub { $_ = encode('utf-8', $_) });
  }
  for my $filter (qw(filter_fetch_key filter_fetch_value)) {
  (tied %h)->$filter(sub { $_ = decode('utf-8', $_) });
  }

Maybe something like this could be added to the DB_File documentation.

Maybe there's also room for a tiny (CPAN) module, say DB_File​::utf8, which does something like this automatically.

Regards,
  Slaven

@p5pRT
Copy link
Author

p5pRT commented Dec 9, 2015

From @eserte

Dana Sri 09. Pro 2015, 12​:52​:43, slaven@​rezic.de reče​:

Dana Uto 08. Pro 2015, 14​:33​:46, frederik@​ofb.net reče​:
[...]
Maybe something like this could be added to the DB_File documentation.

Maybe there's also room for a tiny (CPAN) module, say DB_File​::utf8,
which does something like this automatically.

Missed Tony's answer, and of course, DBM_Filter​::utf8 is there and good enough.

Regards,
  Slaven

@p5pRT
Copy link
Author

p5pRT commented Dec 10, 2015

From @pmqs

From​: slaven@​rezic.de via RT [mailto​:perlbug-followup@​perl.org]

Dana Sri 09. Pro 2015, 12​:52​:43, slaven@​rezic.de rece​:

Dana Uto 08. Pro 2015, 14​:33​:46, frederik@​ofb.net rece​:
[...]
Maybe something like this could be added to the DB_File documentation.

Maybe there's also room for a tiny (CPAN) module, say DB_File​::utf8,
which does something like this automatically.

Missed Tony's answer, and of course, DBM_Filter​::utf8 is there and good enough.

Hey Tony/Slaven - thanks for dealing with this for me. Only just seen it.

Just looking at the DB_File/BerkeleyDB docs I notice that I haven't actually mentioned the DBM_Filter module at all. Should also mention UTF8 in the the DB_File/BerkeleyDB docs as it is a reasonably common use-case.

I see there are new tickets against the modules themselves for this issue, so that means I won't forget to do the update.

cheers
Paul

@p5pRT
Copy link
Author

p5pRT commented Dec 11, 2015

From frederik@ofb.net

Thank you Tony and Slaven for your replies.

I'm sending to bug-DB_File@​rt.cpan.org and bug-BerkeleyDB@​rt.cpan.org
as per instructions at rt.cpan.org.

The bug, to summarize what's below, is really a request for the
documentation of the DB_File and BerkeleyDB packages to explain the
situation with respect to UTF-8 support - namely the lack of special
support, how to interpret the "wide character in subroutine entry"
message, how to put filters on a database object to get UTF-8 to work
right.

I don't think any changes to the code are necessary, given what's been
said by Tony and Slaven.

Thanks again!

On Tue, Dec 08, 2015 at 07​:50​:07PM -0800, Tony Cook via RT wrote​:

On Tue Dec 08 14​:33​:46 2015, frederik@​ofb.net wrote​:

The following program produces the error "Wide character in subroutine
entry at ./bug-example line 23.". I guess it means that DB_File does
not support UTF-8. I notice that when using BerkeleyDB, it works. I
had some trouble debugging this and wanted to suggest some
improvements​:

BerkeleyDB simply isn't warning about the lack of UT8-8 support.

If I add the following to then end of your code​:

my @​keys = keys %h;
print $keys[0] eq $ents[0] ? "match" : "no match";

and uncomment the BerkeleyDB tie, you'll see that the key you supplied
doesn't match the key that the database is storing.

Luckily both BerkeleyDB and DB_File have a mechanism to automatically process
both keys and values, for DB_File​:

use DBM_Filter;
my $db = tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;
$db->Filter_Key_Push('utf8');

for BerkeleyDB​:

my $db = tie %h, "BerkeleyDB​::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
$db->filter_store_key(sub { utf8​::encode($_) });
$db->filter_fetch_key(sub { utf8​::decode($_) });

Here I'm only processing the keys, see the documentation on processing the values instead (or as well).

(perldoc DBM_Filter claims to support BerkeleyDB, but doesn't appear to.)

1. perldiag mentions "Wide character in %s" but not "Wide character in
subroutine entry". The description for the former talks about
filehandles and binmode, while "Wide character in subroutine entry"
seems to demand a use of encode(...). Perhaps the "subroutine enry"
version of the message should be described specially or separately in
perldiag.

That warning is caused by the XS code for DB_File calling SvPVbyte(), and it
happens that the entersub ("subroutine entry") op used to call the XS code
is active at that point.

I'm not sure explaining that would be useful to a normal user reading the documentation.

2. I guess DB_File is a bit old, but I chose it because I don't need
any of the BerkeleyDB features like cursors, and I value backwards
compatibility. Perhaps the man page should mention that it doesn't
work with UTF-8, which would have changed my decision. Or the man page
could even mention that one needs to encode("utf-8", $_) on keys.

3. Then again, DB_File could be updated to support UTF-8.

DB_File is CPAN upstream and is maintained by the same author as BerkeleyDB.

CPAN upstream issues should be reported upstream, see https://rt.cpan.org/Public/Dist/Display.html?Name=DB_File

Tony

On Wed, Dec 09, 2015 at 12​:52​:43PM -0800, slaven@​rezic.de via RT wrote​:

Dana Uto 08. Pro 2015, 14​:33​:46, frederik@​ofb.net reče​:

This is a bug report for perl from frederik@​ofb.net,
generated with the help of perlbug 1.40 running under perl 5.22.0.

-----------------------------------------------------------------
[Please describe your issue here]

The following program produces the error "Wide character in subroutine
entry at ./bug-example line 23.". I guess it means that DB_File does
not support UTF-8. I notice that when using BerkeleyDB, it works. I
had some trouble debugging this and wanted to suggest some
improvements​:

1. perldiag mentions "Wide character in %s" but not "Wide character in
subroutine entry". The description for the former talks about
filehandles and binmode, while "Wide character in subroutine entry"
seems to demand a use of encode(...). Perhaps the "subroutine enry"
version of the message should be described specially or separately in
perldiag.

2. I guess DB_File is a bit old, but I chose it because I don't need
any of the BerkeleyDB features like cursors, and I value backwards
compatibility. Perhaps the man page should mention that it doesn't
work with UTF-8, which would have changed my decision. Or the man page
could even mention that one needs to encode("utf-8", $_) on keys.

3. Then again, DB_File could be updated to support UTF-8.

Thanks so much for a great programming language.

#!/bin/perl

use strict;
use utf8;
use BerkeleyDB;
use DB_File;
use Encode;

$\ = "\n";

my $dbf = "xx.db";
unlink $dbf;

my %h;

# tie %h, "BerkeleyDB​::Btree", -Filename=>$dbf, -Flags=>DB_CREATE;
tie %h, 'DB_File', $dbf, O_CREAT|O_RDWR, 0666, $DB_BTREE;

my @​ents;
# @​ents = map {decode("utf-8", $_)} @​ARGV;
@​ents = decode("utf-8", encode("utf-8",'œ'));

for(@​ents) { $h{$_} = 1; }

print join("\n", keys %h);

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=low
---
Site configuration information for perl 5.22.0​:

Configured by builduser at Tue Jun 2 09​:45​:08 CEST 2015.

Summary of my perl5 (revision 5 version 22 subversion 0)
configuration​:

Platform​:
osname=linux, osvers=4.0.4-2-arch, archname=x86_64-linux-thread-
multi
uname='linux flo-64 4.0.4-2-arch #1 smp preempt fri may 22 03​:05​:23
utc 2015 x86_64 gnulinux '
config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64
-mtune=generic -O2 -pipe -fstack-protector-strong --param=ssp-buffer-
size=4 -Dprefix=/usr -Dvendorprefix=/usr
-Dprivlib=/usr/share/perl5/core_perl
-Darchlib=/usr/lib/perl5/core_perl
-Dsitelib=/usr/share/perl5/site_perl
-Dsitearch=/usr/lib/perl5/site_perl
-Dvendorlib=/usr/share/perl5/vendor_perl
-Dvendorarch=/usr/lib/perl5/vendor_perl -Dscriptdir=/usr/bin/core_perl
-Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl
-Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl
-Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-
needed,-z,relro -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-
aliasing -pipe -fstack-protector-strong -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-
strong --param=ssp-buffer-size=4',
cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing
-pipe -fstack-protector-strong -I/usr/local/include'
ccversion='', gccversion='5.1.0', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678,
doublekind=3
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16,
longdblkind=3
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='cc', ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro
-fstack-protector-strong -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib/gcc/x86_64-unknown-linux-
gnu/5.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib
/lib64 /usr/lib64
libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so
gnulibc_version='2.21'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E
-Wl,-rpath,/usr/lib/perl5/core_perl/CORE'
cccdlflags='-fPIC', lddlflags='-shared -Wl,-O1,--sort-common,--as-
needed,-z,relro -L/usr/local/lib -fstack-protector-strong'

---
@​INC for perl 5.22.0​:
/home/frederik/scripts-misc/perl
/home/frederik/.local/lib/perl5/x86_64-linux-thread-multi
/home/frederik/.local/lib/perl5
/usr/lib/perl5/site_perl
/usr/share/perl5/site_perl
/usr/lib/perl5/vendor_perl
/usr/share/perl5/vendor_perl
/usr/lib/perl5/core_perl
/usr/share/perl5/core_perl
.

---
Environment for perl 5.22.0​:
HOME=/home/frederik
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH=/home/frederik/.local/arch/x86_64/lib​:/home/frederik/.local/lib​:/usr/local/lib
LOGDIR (unset)
PATH=/home/frederik/.local/bin​:/home/frederik/projects/mailproc​:/home/frederik/scripts-
misc​:/home/frederik/.local/arch/x86_64/bin​:/usr/bin/core_perl​:/usr/bin/vendor_perl​:/usr/bin/site_perl​:/usr/local/bin​:/usr/local/sbin​:/usr/bin
PERL5LIB=/home/frederik/scripts-
misc/perl​:/home/frederik/.local/lib/perl5
PERL_BADLANG (unset)
PERL_LOCAL_LIB_ROOT=/home/frederik/.local/​:/home/frederik/.local/​:/home/frederik/.local/​:/home/frederik/.local/
PERL_MB_OPT=--install_base "/home/frederik/.local/"
PERL_MM_OPT=INSTALL_BASE=/home/frederik/.local/
SHELL=/bin/zsh

DB_File (and the underlying berkeley db engine, I guess) can handle only binary (or octets or latin1) data. There's no way to specify a specific encoding, especially for "wide characters". But if you know that you have to store data in the utf8 encoding, then you can define "DBM filters" which do the translation from wide characters into octets and vice versa automatically​:

for my $filter \(qw\(filter\_store\_key filter\_store\_value\)\) \{
    \(tied %h\)\->$filter\(sub \{ $\_ = encode\('utf\-8'\, $\_\) \}\);
\}
for my $filter \(qw\(filter\_fetch\_key filter\_fetch\_value\)\) \{
    \(tied %h\)\->$filter\(sub \{ $\_ = decode\('utf\-8'\, $\_\) \}\);
\}

Maybe something like this could be added to the DB_File documentation.

Maybe there's also room for a tiny (CPAN) module, say DB_File​::utf8, which does something like this automatically.

Regards,
Slaven

On Wed, Dec 09, 2015 at 01​:13​:31PM -0800, slaven@​rezic.de via RT wrote​:

Dana Sri 09. Pro 2015, 12​:52​:43, slaven@​rezic.de reče​:

Dana Uto 08. Pro 2015, 14​:33​:46, frederik@​ofb.net reče​:
[...]
Maybe something like this could be added to the DB_File documentation.

Maybe there's also room for a tiny (CPAN) module, say DB_File​::utf8,
which does something like this automatically.

Missed Tony's answer, and of course, DBM_Filter​::utf8 is there and good enough.

Regards,
Slaven

@p5pRT
Copy link
Author

p5pRT commented Dec 16, 2015

From @tonycoz

On Thu Dec 10 15​:27​:33 2015, paul.marquess@​ntlworld.com wrote​:

Hey Tony/Slaven - thanks for dealing with this for me. Only just seen
it.

Just looking at the DB_File/BerkeleyDB docs I notice that I haven't
actually mentioned the DBM_Filter module at all. Should also mention
UTF8 in the the DB_File/BerkeleyDB docs as it is a reasonably common
use-case.

I see there are new tickets against the modules themselves for this
issue, so that means I won't forget to do the update.

The only core issue (since DB_File and BerkeleyDB are CPAN upstream) that
I see is that DBM_Filter claims​:

  In addition
to the *DB*_File modules distributed with Perl, the BerkeleyDB module,
available on CPAN, supports the DBM Filter hooks.

but when I tested the code from my sample calling Filter_Key_Push
against the BerkeleyDB $db object the method wasn't available.

Is BerkeleyDB meant to support DBM_Filter?

Tony

@p5pRT
Copy link
Author

p5pRT commented Dec 16, 2015

From @pmqs

From​: Tony Cook via RT [mailto​:perlbug-followup@​perl.org]

On Thu Dec 10 15​:27​:33 2015, paul.marquess@​ntlworld.com wrote​:

Hey Tony/Slaven - thanks for dealing with this for me. Only just seen
it.

Just looking at the DB_File/BerkeleyDB docs I notice that I haven't
actually mentioned the DBM_Filter module at all. Should also mention
UTF8 in the the DB_File/BerkeleyDB docs as it is a reasonably common
use-case.

I see there are new tickets against the modules themselves for this
issue, so that means I won't forget to do the update.

The only core issue (since DB_File and BerkeleyDB are CPAN upstream) that
I see is that DBM_Filter claims​:

                                                      In addition to

the *DB*_File modules distributed with Perl, the BerkeleyDB module,
available on CPAN, supports the DBM Filter hooks.

but when I tested the code from my sample calling Filter_Key_Push against
the BerkeleyDB $db object the method wasn't available.

Is BerkeleyDB meant to support DBM_Filter?

Yes, it is. I'll take a look.

Paul

@p5pRT
Copy link
Author

p5pRT commented Oct 16, 2017

From @tonycoz

On Tue, 08 Dec 2015 19​:50​:07 -0800, tonyc wrote​:

DB_File is CPAN upstream and is maintained by the same author as
BerkeleyDB.

CPAN upstream issues should be reported upstream, see
https://rt.cpan.org/Public/Dist/Display.html?Name=DB_File

This was reported upstream as https://rt.cpan.org/Public/Bug/Display.html?id=110248 so closing.

Tony

@p5pRT
Copy link
Author

p5pRT commented Oct 16, 2017

@tonycoz - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant