Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about 'use bytes' #12882

Open
p5pRT opened this issue Mar 26, 2013 · 28 comments
Open

What to do about 'use bytes' #12882

p5pRT opened this issue Mar 26, 2013 · 28 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 26, 2013

Migrated from rt.perl.org#117355 (status was 'open')

Searchable as RT117355$

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2013

From @Hugmeir

Created by @Hugmeir

$_ = "\x{30cb}";
use Devel​::Peek;
use bytes;
Dump $_ for uc, lc, CORE​::fc, ucfirst, lcfirst;

ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three
return bytes.

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl 5.16.2:

Configured by hugmeir at Tue Nov 20 17:20:00 ART 2012.

Summary of my perl5 (revision 5 version 16 subversion 2) configuration:

  Platform:
    osname=linux, osvers=3.5.0-18-generic, archname=x86_64-linux-thread-multi
    uname='linux naw 3.5.0-18-generic #29-ubuntu smp fri oct 19
10:26:51 utc 2012 x86_64 x86_64 x86_64 gnulinux '
    config_args='-de
-Dprefix=/home/hugmeir/perl5/perlbrew/perls/perl-5.16.2 -DDEBUGGING
-Dusethreads -Doptimize=-g -O0 -ggdb3 -Uversiononly -Accflags=-Wall
-Wextra -Aeval:scriptdir=/home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -Wall -Wextra
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g -O0 -ggdb3',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -Wall -Wextra -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.7.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.15'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -g -O0 -ggdb3
-L/usr/local/lib -fstack-protector'

Locally applied patches:



@INC for perl 5.16.2:
    /home/hugmeir/.perlbrew/libs/perl-5.16.2@all/lib/perl5/x86_64-linux-gnu-thread-multi
    /home/hugmeir/.perlbrew/libs/perl-5.16.2@all/lib/perl5/x86_64-linux-thread-multi
    /home/hugmeir/.perlbrew/libs/perl-5.16.2@all/lib/perl5
    /home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/lib/site_perl/5.16.2/x86_64-linux-thread-multi
    /home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/lib/site_perl/5.16.2
    /home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/lib/5.16.2/x86_64-linux-thread-multi
    /home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/lib/5.16.2
    .


Environment for perl 5.16.2:
    HOME=/home/hugmeir
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/hugmeir/.rbenv/shims:/home/hugmeir/.rbenv/bin:/home/hugmeir/.perlbrew/libs/perl-5.16.2@all/bin:/home/hugmeir/perl5/perlbrew/bin:/home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/bin:/home/hugmeir/.rbenv/shims:/home/hugmeir/.rbenv/bin:/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
    PERL5LIB=/home/hugmeir/.perlbrew/libs/perl-5.16.2@all/lib/perl5/x86_64-linux-gnu-thread-multi:/home/hugmeir/.perlbrew/libs/perl-5.16.2@all/lib/perl5
    PERLBREW_BASHRC_VERSION=0.46
    PERLBREW_HOME=/home/hugmeir/.perlbrew
    PERLBREW_LIB=all
    PERLBREW_MANPATH=/home/hugmeir/.perlbrew/libs/perl-5.16.2@all/man:/home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/man
    PERLBREW_PATH=/home/hugmeir/.perlbrew/libs/perl-5.16.2@all/bin:/home/hugmeir/perl5/perlbrew/bin:/home/hugmeir/perl5/perlbrew/perls/perl-5.16.2/bin
    PERLBREW_PERL=perl-5.16.2
    PERLBREW_ROOT=/home/hugmeir/perl5/perlbrew
    PERLBREW_VERSION=0.46
    PERL_BADLANG (unset)
    PERL_LOCAL_LIB_ROOT=/home/hugmeir/.perlbrew/libs/perl-5.16.2@all
    PERL_MB_OPT=--install_base /home/hugmeir/.perlbrew/libs/perl-5.16.2@all
    PERL_MM_OPT=INSTALL_BASE=/home/hugmeir/.perlbrew/libs/perl-5.16.2@all
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2013

From @ap

* Brian Fraser <perlbug-followup@​perl.org> [2013-03-26 01​:50]​:

$_ = "\x{30cb}";
use Devel​::Peek;
use bytes;
Dump $_ for uc, lc, CORE​::fc, ucfirst, lcfirst;

ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three
return bytes.

Is it worth fixing something to follow a semantic that itself is broken
as designed?

I’m not sure if we had an explicit consensus about bytes.pm being highly
discouraged, the way we had about encoding.pm deserving deprecation, but
I would be happy if we could move it in that direction; and the farther,
the happier.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2013

From @ikegami

On Mon, Mar 25, 2013 at 9​:50 PM, Aristotle Pagaltzis <pagaltzis@​gmx.de>wrote​:

* Brian Fraser <perlbug-followup@​perl.org> [2013-03-26 01​:50]​:

$_ = "\x{30cb}";
use Devel​::Peek;
use bytes;
Dump $_ for uc, lc, CORE​::fc, ucfirst, lcfirst;

ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three
return bytes.

Is it worth fixing something to follow a semantic that itself is broken
as designed?

I’m not sure if we had an explicit consensus about bytes.pm being highly
discouraged

"and use of this module for anything other than debugging purposes is
strongly discouraged."

, the way we had about encoding.pm deserving deprecation,

"This module is deprecated under perl 5.18. It uses a mechanism provided by
perl that is deprecated under 5.18 and higher, and may be removed in a
future version."

At least publicly, it's not quite the same level.

but I would be happy if we could move it in that direction; and the

farther, the happier.

Indeed. If you have to deal with a buggy module, should be using
utf8​::downgrade instead.

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2013

From @iabyn

On Tue, Mar 26, 2013 at 02​:50​:34AM +0100, Aristotle Pagaltzis wrote​:

* Brian Fraser <perlbug-followup@​perl.org> [2013-03-26 01​:50]​:

$_ = "\x{30cb}";
use Devel​::Peek;
use bytes;
Dump $_ for uc, lc, CORE​::fc, ucfirst, lcfirst;

ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three
return bytes.

Is it worth fixing something to follow a semantic that itself is broken
as designed?

I’m not sure if we had an explicit consensus about bytes.pm being highly
discouraged, the way we had about encoding.pm deserving deprecation, but
I would be happy if we could move it in that direction; and the farther,
the happier.

From the top of the pod in bytes.pm, added for 5.12.0​:

=head1 NOTICE

This pragma reflects early attempts to incorporate Unicode into perl and
has since been superseded. It breaks encapsulation (i.e. it exposes the
innards of how the perl executable currently happens to store a string),
and use of this module for anything other than debugging purposes is
strongly discouraged. If you feel that the functions here within might be
useful for your application, this possibly indicates a mismatch between
your mental model of Perl Unicode and the current reality. In that case,
you may wish to read some of the perl Unicode documentation​:
L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.

--
Technology is dominated by two types of people​: those who understand what
they do not manage, and those who manage what they do not understand.

@p5pRT
Copy link
Author

p5pRT commented May 5, 2013

From @khwilliamson

On 03/25/2013 11​:45 PM, Eric Brine wrote​:

On Mon, Mar 25, 2013 at 9​:50 PM, Aristotle Pagaltzis <pagaltzis@​gmx.de
<mailto​:pagaltzis@​gmx.de>> wrote​:

\* Brian Fraser \<perlbug\-followup@&#8203;perl\.org
\<mailto&#8203;:perlbug\-followup@&#8203;perl\.org>> \[2013\-03\-26 01&#8203;:50\]&#8203;:
 > $\_ = "\\x\{30cb\}";
 > use Devel&#8203;::Peek;
 > use bytes;
 > Dump $\_ for uc\, lc\, CORE&#8203;::fc\, ucfirst\, lcfirst;
 >
 > ucfirst & lcfirst return a UTF\-8 flagged scalar\, while the first
three
 > return bytes\.

Is it worth fixing something to follow a semantic that itself is broken
as designed?

I’m not sure if we had an explicit consensus about bytes\.pm
\<http&#8203;://bytes\.pm> being highly
discouraged

"and use of this module for anything other than debugging purposes is
strongly discouraged."

\, the way we had about encoding\.pm \<http&#8203;://encoding\.pm> deserving
deprecation\,

"This module is deprecated under perl 5.18. It uses a mechanism provided
by perl that is deprecated under 5.18 and higher, and may be removed in
a future version."

At least publicly, it's not quite the same level.

but I would be happy if we could move it in that direction; and the
farther\, the happier\.

Indeed. If you have to deal with a buggy module, should be using
utf8​::downgrade instead.

Attached is a patch that fixes the original report. The code it changes
is a small portion of this commit​:

commit d54190f
  Author​: Nicholas Clark <nick@​ccl4.org>
  Date​: Sat Apr 29 15​:55​:51 2006 +0000

  lcfirst/ucfirst plus an 8 bit locale could mangle UTF-8 values
  returned by overloaded stringification.

I was the one who added the comments much later. I was trying to make
sense of that code, and I think now that I didn't fully grok things.

I'm tempted to apply the patch unless someone can say why it would break
things, which would mean that the other functions are broken as well.
People do use 'use bytes'; we aren't going to remove it any time soon.

@p5pRT
Copy link
Author

p5pRT commented May 5, 2013

From @khwilliamson

0036-XXX-Patch-for-discussion-perl-117355-lu-cfirst-don-t.patch
From adaf7fd4f5bfc1575a0c9fc185002d9d6d5ca11a Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 4 May 2013 20:42:48 -0600
Subject: [PATCH 36/36] XXX Patch for discussion: [perl #117355] [lu]cfirst
 don't respect 'use bytes'

---
 pp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/pp.c b/pp.c
index 865438b..d6e18e0 100644
--- a/pp.c
+++ b/pp.c
@@ -3655,7 +3655,7 @@ PP(pp_ucfirst)
 
 	/* In a "use bytes" we don't treat the source as UTF-8, but, still want
 	 * the destination to retain that flag */
-	if (SvUTF8(source))
+	if (SvUTF8(source) && ! IN_BYTES)
 	    SvUTF8_on(dest);
 
 	if (!inplace) {	/* Finish the rest of the string, unchanged */
-- 
1.8.1.3

@p5pRT
Copy link
Author

p5pRT commented Jul 5, 2013

From @tonycoz

On Sat May 04 20​:03​:46 2013, public@​khwilliamson.com wrote​:

Attached is a patch that fixes the original report. The code it
changes
is a small portion of this commit​:

commit d54190f
Author​: Nicholas Clark <nick@​ccl4.org>
Date​: Sat Apr 29 15​:55​:51 2006 +0000

  lcfirst/ucfirst plus an 8 bit locale could mangle UTF\-8 values
  returned by overloaded stringification\.

I was the one who added the comments much later. I was trying to make
sense of that code, and I think now that I didn't fully grok things.

I'm tempted to apply the patch unless someone can say why it would
break
things, which would mean that the other functions are broken as well.
People do use 'use bytes'; we aren't going to remove it any time soon.

As much as I despise use bytes, I think this patch could go in, but it
would need tests.

If no-one else provides tests I'll write some over the next few days.

Or not, if people object to the change, in which case they should
propose an alternative.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2013

From @tonycoz

On Thu Jul 04 18​:40​:10 2013, tonyc wrote​:

On Sat May 04 20​:03​:46 2013, public@​khwilliamson.com wrote​:

Attached is a patch that fixes the original report. The code it
changes
is a small portion of this commit​:

As much as I despise use bytes, I think this patch could go in, but it
would need tests.

If no-one else provides tests I'll write some over the next few days.

Or not, if people object to the change, in which case they should
propose an alternative.

Attached, some very basic tests, bytes.pm doesn't deserve much more :)

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2013

From @tonycoz

0001-perl-117355-very-basic-tests-for-ul-c-first-under-us.patch
From 40d56eaa790175e9c2b803c17bc5c38e03dc5ddc Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 15 Jul 2013 16:06:46 +1000
Subject: [PATCH 1/3] [perl #117355] very basic tests for [ul]c(first)? under
 use bytes

the [lu]cfirst tests are TODO due to #117355
---
 lib/bytes.t |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/bytes.t b/lib/bytes.t
index c1ea9ea..6ac18df 100644
--- a/lib/bytes.t
+++ b/lib/bytes.t
@@ -5,7 +5,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 20;
+plan tests => 24;
 
 my $a = chr(0x100);
 
@@ -28,6 +28,8 @@ is(bytes::chr(0x100), chr(0),  "bytes::chr sanity check");
 }
 
 my $c = chr(0x100);
+my $c2 = chr(0x2c7); # a unicode character that doesn't fold
+utf8::encode(my $c2_utf8 = $c2);
 
 {
     use bytes;
@@ -56,6 +58,13 @@ my $c = chr(0x100);
         is(bytes::rindex($c, "\xc4"), 0, "bytes::rindex under use bytes looks at bytes");
     }
     
+    # [perl #117355] [lu]cfirst don't respect 'use bytes'
+    # and if there's other tests for lc/uc under bytes I didn't find them
+    is(lc($c2), $c2_utf8, "lc under use bytes returns bytes");
+    is(uc($c2), $c2_utf8, "uc under use bytes returns bytes");
+    local $TODO = "[perl #117355] [lu]cfirst don't respect 'use bytes'";
+    is(lcfirst($c2), $c2_utf8, "lcfirst under use bytes returns bytes");
+    is(ucfirst($c2), $c2_utf8, "unfirst under use bytes returns bytes");
 }
 
 {
-- 
1.7.10.4

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2013

From @cpansprout

On Tue Mar 26 03​:50​:37 2013, davem wrote​:

On Tue, Mar 26, 2013 at 02​:50​:34AM +0100, Aristotle Pagaltzis wrote​:

* Brian Fraser <perlbug-followup@​perl.org> [2013-03-26 01​:50]​:

$_ = "\x{30cb}";
use Devel​::Peek;
use bytes;
Dump $_ for uc, lc, CORE​::fc, ucfirst, lcfirst;

ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three
return bytes.

Is it worth fixing something to follow a semantic that itself is broken
as designed?

I’m not sure if we had an explicit consensus about bytes.pm being highly
discouraged, the way we had about encoding.pm deserving deprecation, but
I would be happy if we could move it in that direction; and the farther,
the happier.

From the top of the pod in bytes.pm, added for 5.12.0​:

=head1 NOTICE

This pragma reflects early attempts to incorporate Unicode into perl and
has since been superseded. It breaks encapsulation (i.e. it exposes the
innards of how the perl executable currently happens to store a string),
and use of this module for anything other than debugging purposes is
strongly discouraged. If you feel that the functions here within might be
useful for your application, this possibly indicates a mismatch between
your mental model of Perl Unicode and the current reality. In that case,
you may wish to read some of the perl Unicode documentation​:
L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.

What can we do to upgrade this to a deprecation?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @rjbs

On Sun Jul 14 23​:54​:35 2013, sprout wrote​:

From the top of the pod in bytes.pm, added for 5.12.0​:

=head1 NOTICE

This pragma reflects early attempts to incorporate Unicode into perl and
has since been superseded. It breaks encapsulation (i.e. it exposes the
innards of how the perl executable currently happens to store a string),
and use of this module for anything other than debugging purposes is
strongly discouraged. If you feel that the functions here within might be
useful for your application, this possibly indicates a mismatch between
your mental model of Perl Unicode and the current reality. In that case,
you may wish to read some of the perl Unicode documentation​:
L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.

What can we do to upgrade this to a deprecation?

I'm not sure.

The question is​: do we propose to allow bytes.pm to become an external library? Can we do
this usefully, since bytes currently works (as I understand it) by tweaking $^H and letting CORE
sort out the rest? Can its behavior be reimplemented as something entirely without core
support. I think so, by making copies and downgrading. (four arg substr won't be exactly that
simple, but should be doable.)

I haven't given this a lot of thought, but I think that if we can make bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @tonycoz

On Sun Jul 14 23​:22​:33 2013, tonyc wrote​:

On Thu Jul 04 18​:40​:10 2013, tonyc wrote​:

On Sat May 04 20​:03​:46 2013, public@​khwilliamson.com wrote​:

Attached is a patch that fixes the original report. The code it
changes
is a small portion of this commit​:

As much as I despise use bytes, I think this patch could go in, but it
would need tests.

If no-one else provides tests I'll write some over the next few days.

Or not, if people object to the change, in which case they should
propose an alternative.

Attached, some very basic tests, bytes.pm doesn't deserve much more :)

Applied as ac99361,
93e088e,
ae5c28e.

I've leave the ticket open for the deprecation discussion, though
perhaps that belongs in a different ticket.

Tony

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @cpansprout

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

On Sun Jul 14 23​:54​:35 2013, sprout wrote​:

From the top of the pod in bytes.pm, added for 5.12.0​:

=head1 NOTICE

This pragma reflects early attempts to incorporate Unicode into
perl and
has since been superseded. It breaks encapsulation (i.e. it
exposes the
innards of how the perl executable currently happens to store a
string),
and use of this module for anything other than debugging purposes
is
strongly discouraged. If you feel that the functions here within
might be
useful for your application, this possibly indicates a mismatch
between
your mental model of Perl Unicode and the current reality. In that
case,
you may wish to read some of the perl Unicode documentation​:
L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.

What can we do to upgrade this to a deprecation?

I'm not sure.

The question is​: do we propose to allow bytes.pm to become an
external library? Can we do
this usefully, since bytes currently works (as I understand it) by
tweaking $^H and letting CORE
sort out the rest? Can its behavior be reimplemented as something
entirely without core
support. I think so, by making copies and downgrading. (four arg
substr won't be exactly that
simple, but should be doable.)

I haven't given this a lot of thought, but I think that if we can make
bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling
people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

How much of it would we reimplement?

If we want to keep its current behaviour, we would end up having to
override almost every op in what would become bytes.xs. Just search for
uses of DO_UTF8 throughout the core. DO_UTF8 means SvUTF8 unless bytes
is turned on, in which case we pretend the flag is not set. That means
"\xff"."\xff" returns "\xff\xc3\xbf" if the rhs is in utf8. So we have
to override concatenation via PL_check hooks, which gets messy. It
seems like a lot of work for preserving broken behaviour.

Are you suggesting just a subset of the behaviour?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @cpansprout

On Sun Aug 11 23​:22​:25 2013, sprout wrote​:

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

On Sun Jul 14 23​:54​:35 2013, sprout wrote​:

From the top of the pod in bytes.pm, added for 5.12.0​:

=head1 NOTICE

This pragma reflects early attempts to incorporate Unicode into
perl and
has since been superseded. It breaks encapsulation (i.e. it
exposes the
innards of how the perl executable currently happens to store a
string),
and use of this module for anything other than debugging purposes
is
strongly discouraged. If you feel that the functions here within
might be
useful for your application, this possibly indicates a mismatch
between
your mental model of Perl Unicode and the current reality. In that
case,
you may wish to read some of the perl Unicode documentation​:
L<perluniintro>, L<perlunitut>, L<perlunifaq> and L<perlunicode>.

What can we do to upgrade this to a deprecation?

I'm not sure.

The question is​: do we propose to allow bytes.pm to become an
external library? Can we do
this usefully, since bytes currently works (as I understand it) by
tweaking $^H and letting CORE
sort out the rest? Can its behavior be reimplemented as something
entirely without core
support. I think so, by making copies and downgrading. (four arg
substr won't be exactly that
simple, but should be doable.)

I haven't given this a lot of thought, but I think that if we can make
bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling
people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

How much of it would we reimplement?

If we want to keep its current behaviour, we would end up having to
override almost every op in what would become bytes.xs. Just search for
uses of DO_UTF8 throughout the core. DO_UTF8 means SvUTF8 unless bytes
is turned on, in which case we pretend the flag is not set. That means
"\xff"."\xff" returns "\xff\xc3\xbf" if the rhs is in utf8. So we have
to override concatenation via PL_check hooks, which gets messy. It
seems like a lot of work for preserving broken behaviour.

Just deprecating bytes.pm outright, not just deprecating it from core,
would be the easiest route, of course.

In 5.20 and 5.22 it warns ‘Use of bytes.pm is deprecated’.
In 5.24 it dies with ‘Can’t locate bytes.pm in @​INC...’.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

I haven't given this a lot of thought, but I think that if we can make
bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling
people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

Possible use bytes to speedup regexps by 20-40% at some cases​:

use warnings;
use strict;
use utf8;
use bytes ();
use Encode;
use Benchmark qw/​:all/;

sub test
{
  my ($s) = @​_;
  $s = ($s x 400)."z";
  $s =~ /z/ for (1..100);
}

sub try_drop_utf8_flag
{
  Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "x тест");

die unless $ascii_u eq $ascii;

cmpthese(-1,{
  'ascii' => sub { test($ascii); },
  'ascii with utf8 on' => sub { test($ascii_u);},
  'ascii with utf8 bit cleared' => sub {
  my $s = $ascii_u;
  try_drop_utf8_flag($s);
  die if utf8​::is_utf8($s);
  test($s);
  },
});

__END__

perl-5.18.0

  Rate ascii with utf8 on ascii with utf8
bit cleared ascii
ascii with utf8 on 22635/s --
  -27% -28%
ascii with utf8 bit cleared 30919/s 37%
  -- -2%
ascii 31508/s 39%
  2% --

perl-5.10.0

  Rate ascii with utf8 on ascii with utf8
bit cleared ascii
ascii with utf8 on 25831/s --
  -19% -21%
ascii with utf8 bit cleared 31717/s 23%
  -- -4%
ascii 32881/s 27%
  4% --

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

I haven't given this a lot of thought, but I think that if we can make
bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling
people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

Another possible use of bytes are​:

1) run-time, production-enabled assertions (
http​://en.wikipedia.org/wiki/Assertion_%28software_development%29 ).
It's similar to debugging, except performance matters.

2) Unit tests (sometimes performance matters).

Below example contains a bug (from Perl point view this can be treated
as not-a-bug, but from programmer point of view it's a bug).

(bug marked with "# THIS LINE CONTAINS A BUG")

It does not affect anything, even program output, except
performance/memory usage.

bin_u is 7 bytes length, and bin_a is 4 bytes length.

if 7 vs 4 bytes looks unimportant, consider 700 vs 400 MiB of binary files.

And this bug can be caught (runtime or in unit tests) if line "#die if
is_wide_string($bin_u);" uncommented.

The only possible way to catch this is a use of bytes​::length (or
similar function which count bytes), because final output is same with
or without bug.

=====

use Encode;
use utf8;
use bytes ();
use strict;
use warnings;

sub is_wide_string
{
  defined($_[0]) && utf8​::is_utf8($_[0]) && (bytes​::length($_[0]) !=
length($_[0]))
}

my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"

# plain binary data, for example part of binary file (say, JPEG)
my $bin = "\xf1\xf2\xf3";

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");
die unless utf8​::is_utf8($ascii_u);

print "original bin length​:\t";
print length($bin) . "\t" . bytes​::length($bin) ."\n";

my $bin_a = $bin.$ascii;

print "bin_a length​:\t";
print length($bin_a) . "\t" . bytes​::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

#die if is_wide_string($bin_u);

print "bin_u length​:\t";
print length($bin_u) . "\t" . bytes​::length($bin_u) ."\n";

open my $f, ">", "file_a.tmp";
binmode $f;
syswrite $f, $bin_a;
close $f;

open $f, ">", "file_u.tmp";
binmode $f;
syswrite $f, $bin_u;
close $f;

system("md5sum file_?.tmp");

__END__
original bin length​: 3 3
bin_a length​: 4 4
bin_u length​: 4 7
33818f4b23aa74cddb8eb625845a459a file_a.tmp
33818f4b23aa74cddb8eb625845a459a file_u.tmp

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

And another edge case when perl unicode not working well, is filenames.

Code below prints that two strings are same. Tries to open file with
name defined by one string, and then to reopen file with name defined by
second string. Second attempt fail.

So that is another case when user might want to use bytes​::xxx,
_is_utf8() etc to access perl string internals, because internals
matters in this case.

(btw, I have a program which have to deal with non-UTF filesystems, this
makes things even worse. it has to pass _binary_ strings representing
filename across whole program and I should be very careful and never
merge it with ASCII strings with utf-8 bin or Unicode strings)

========

use Encode;
use utf8;
use strict;
use warnings;

my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"

# plain binary data, for example part of binary file (say, JPEG)
my $bin = "\xf1\xf2\xf3";

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");
die unless utf8​::is_utf8($ascii_u);

print "original bin length​:\t";
print length($bin) . "\t" . bytes​::length($bin) ."\n";

my $bin_a = $bin.$ascii;

print "bin_a length​:\t";
print length($bin_a) . "\t" . bytes​::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

die unless $bin_u eq $bin_a;
print "bin_u and bin_a are same!\n";

open my $f, ">", "$bin_u.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$bin_a.tmp" or die "file not found $!";
__END__
original bin length​: 3 3
bin_a length​: 4 4
bin_u and bin_a are same!
file not found No such file or directory at poc5.pl line 39.

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @ikegami

On Mon, Aug 12, 2013 at 4​:08 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

That's just C<< utf8​::downgrade($_[0], 1) >>

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From @Hugmeir

On Mon, Aug 12, 2013 at 5​:08 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

On Sun Aug 11 19​:56​:53 2013, rjbs wrote​:

I haven't given this a lot of thought, but I think that if we can make
bytes.pm ejectable, we
should do so. It's okay if it gets slower, since we've been telling
people for years that it's only a
debugging tool, if that.

Thoughts? Objections?

Possible use bytes to speedup regexps by 20-40% at some cases​:

use warnings;
use strict;
use utf8;
use bytes ();
use Encode;
use Benchmark qw/​:all/;

sub test
{
my ($s) = @​_;
$s = ($s x 400)."z";
$s =~ /z/ for (1..100);
}

sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "x тест");

die unless $ascii_u eq $ascii;

cmpthese(-1,{
'ascii' => sub { test($ascii); },
'ascii with utf8 on' => sub { test($ascii_u);},
'ascii with utf8 bit cleared' => sub {
my $s = $ascii_u;
try_drop_utf8_flag($s);
die if utf8​::is_utf8($s);
test($s);
},
});

__END__

perl-5.18.0

                           Rate ascii with utf8 on ascii with utf8

bit cleared ascii
ascii with utf8 on 22635/s --
-27% -28%
ascii with utf8 bit cleared 30919/s 37%
-- -2%
ascii 31508/s 39%
2% --

perl-5.10.0

                           Rate ascii with utf8 on ascii with utf8

bit cleared ascii
ascii with utf8 on 25831/s --
-19% -21%
ascii with utf8 bit cleared 31717/s 23%
-- -4%
ascii 32881/s 27%
4% --

I've tweaked your benchmark a bit​:
use warnings;
use strict;
use utf8;
use Benchmark qw/​:all/;

my $ascii = ("x" x 400) . "z";
my $ascii_u = ("\N{LATIN SMALL LETTER X}" x 400) . "z";
my $bytes = $ascii_u;

utf8​::encode($bytes);

die unless $ascii_u eq $ascii;

cmpthese(timethese(10000,{
  'ascii' => sub { $ascii =~ /z/ for 1..100 },
  'ascii, utf8' => sub { $ascii_u =~ /z/ for 1..100 },
  'ascii, utf8 cleared' => sub { $bytes =~ /z/ for 1..100 },
  'ascii, /aa' => sub { $ascii =~ /z/aa for 1..100 },
  'ascii, utf8, /aa' => sub { $ascii_u =~ /z/aa for 1..100 },
  'ascii, utf8 cleared, /aa' => sub { $bytes =~ /z/aa for 1..100 },
  'ascii, bytes' => sub { use bytes; $ascii =~ /z/ for 1..100 },
  'ascii, utf8, bytes' => sub { use bytes; $ascii_u =~ /z/ for 1..100
},
  'ascii, utf8 cleared, bytes' => sub { use bytes; $bytes =~ /z/ for 1..100
},
}));

Which gives me, on 5.14.2​:
Benchmark​: timing 10000 iterations of ascii, ascii, /aa, ascii, bytes,
ascii, utf8, ascii, utf8 cleared, ascii, utf8 cleared, /aa, ascii, utf8
cleared, bytes, ascii, utf8, /aa, ascii, utf8, bytes...
  ascii​: 2 wallclock secs ( 1.55 usr + 0.00 sys = 1.55 CPU) @​
6451.61/s (n=10000)
ascii, /aa​: 1 wallclock secs ( 1.54 usr + 0.00 sys = 1.54 CPU) @​
6493.51/s (n=10000)
ascii, bytes​: 2 wallclock secs ( 1.71 usr + 0.00 sys = 1.71 CPU) @​
5847.95/s (n=10000)
ascii, utf8​: 3 wallclock secs ( 3.04 usr + 0.00 sys = 3.04 CPU) @​
3289.47/s (n=10000)
ascii, utf8 cleared​: 2 wallclock secs ( 1.87 usr + 0.00 sys = 1.87 CPU)
@​ 5347.59/s (n=10000)
ascii, utf8 cleared, /aa​: 2 wallclock secs ( 1.84 usr + 0.00 sys = 1.84
CPU) @​ 5434.78/s (n=10000)
ascii, utf8 cleared, bytes​: 2 wallclock secs ( 1.92 usr + 0.00 sys =
1.92 CPU) @​ 5208.33/s (n=10000)
ascii, utf8, /aa​: 3 wallclock secs ( 2.87 usr + 0.00 sys = 2.87 CPU) @​
3484.32/s (n=10000)
ascii, utf8, bytes​: 2 wallclock secs ( 1.82 usr + 0.01 sys = 1.83 CPU) @​
5464.48/s (n=10000)
  Rate ascii, utf8 ascii, utf8, /aa ascii, utf8
cleared, bytes ascii, utf8 cleared ascii, utf8 cleared, /aa ascii, utf8,
bytes ascii, bytes ascii ascii, /aa
ascii, utf8 3289/s --
-6% -37% -38%
-39% -40% -44% -49% -49%
ascii, utf8, /aa 3484/s 6%
-- -33% -35%
-36% -36% -40% -46% -46%
ascii, utf8 cleared, bytes 5208/s 58%
49% -- -3%
-4% -5% -11% -19% -20%
ascii, utf8 cleared 5348/s 63%
53% 3% --
-2% -2% -9% -17% -18%
ascii, utf8 cleared, /aa 5435/s 65%
56% 4% 2%
-- -1% -7% -16% -16%
ascii, utf8, bytes 5464/s 66%
57% 5% 2%
1% -- -7% -15% -16%
ascii, bytes 5848/s 78%
68% 12% 9%
8% 7% -- -9% -10%
ascii 6452/s 96%
85% 24% 21%
19% 18% 10% -- -1%
ascii, /aa 6494/s 97%
86% 25% 21%
19% 19% 11% 1% --

On 5.16.2​:
Benchmark​: timing 10000 iterations of ascii, ascii, /aa, ascii, bytes,
ascii, utf8, ascii, utf8 cleared, ascii, utf8 cleared, /aa, ascii, utf8
cleared, bytes, ascii, utf8, /aa, ascii, utf8, bytes...
  ascii​: 2 wallclock secs ( 1.51 usr + 0.00 sys = 1.51 CPU) @​
6622.52/s (n=10000)
ascii, /aa​: 1 wallclock secs ( 1.54 usr + 0.00 sys = 1.54 CPU) @​
6493.51/s (n=10000)
ascii, bytes​: 2 wallclock secs ( 1.53 usr + 0.00 sys = 1.53 CPU) @​
6535.95/s (n=10000)
ascii, utf8​: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @​
4739.34/s (n=10000)
ascii, utf8 cleared​: 1 wallclock secs ( 1.52 usr + 0.00 sys = 1.52 CPU)
@​ 6578.95/s (n=10000)
ascii, utf8 cleared, /aa​: 2 wallclock secs ( 1.52 usr + 0.01 sys = 1.53
CPU) @​ 6535.95/s (n=10000)
ascii, utf8 cleared, bytes​: 2 wallclock secs ( 1.51 usr + 0.01 sys =
1.52 CPU) @​ 6578.95/s (n=10000)
ascii, utf8, /aa​: 2 wallclock secs ( 2.11 usr + 0.00 sys = 2.11 CPU) @​
4739.34/s (n=10000)
ascii, utf8, bytes​: 1 wallclock secs ( 1.56 usr + 0.00 sys = 1.56 CPU) @​
6410.26/s (n=10000)
  Rate ascii, utf8 ascii, utf8, /aa ascii, utf8,
bytes ascii, /aa ascii, bytes ascii, utf8 cleared, /aa ascii, utf8 cleared,
bytes ascii, utf8 cleared ascii
ascii, utf8 4739/s --
-0% -26% -27% -27%
-27% -28% -28% -28%
ascii, utf8, /aa 4739/s 0%
-- -26% -27% -27%
-27% -28% -28% -28%
ascii, utf8, bytes 6410/s 35%
35% -- -1% -2%
-2% -3% -3% -3%
ascii, /aa 6494/s 37%
37% 1% -- -1%
-1% -1% -1% -2%
ascii, bytes 6536/s 38%
38% 2% 1% --
-0% -1% -1% -1%
ascii, utf8 cleared, /aa 6536/s 38%
38% 2% 1% 0%
-- -1% -1% -1%
ascii, utf8 cleared, bytes 6579/s 39%
39% 3% 1% 1%
1% -- 0% -1%
ascii, utf8 cleared 6579/s 39%
39% 3% 1% 1%
1% 0% -- -1%
ascii 6623/s 40%
40% 3% 2% 1%
1% 1% 1% --

(skipping 5.18 because I compiled it with different options than everything
else, whoops)
On blead​:
Benchmark​: timing 10000 iterations of ascii, ascii, /aa, ascii, bytes,
ascii, utf8, ascii, utf8 cleared, ascii, utf8 cleared, /aa, ascii, utf8
cleared, bytes, ascii, utf8, /aa, ascii, utf8, bytes...
  ascii​: 2 wallclock secs ( 1.60 usr + 0.00 sys = 1.60 CPU) @​
6250.00/s (n=10000)
ascii, /aa​: 1 wallclock secs ( 1.58 usr + 0.00 sys = 1.58 CPU) @​
6329.11/s (n=10000)
ascii, bytes​: 2 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU) @​
6289.31/s (n=10000)
ascii, utf8​: 2 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU) @​
6134.97/s (n=10000)
ascii, utf8 cleared​: 1 wallclock secs ( 1.61 usr + 0.00 sys = 1.61 CPU)
@​ 6211.18/s (n=10000)
ascii, utf8 cleared, /aa​: 2 wallclock secs ( 1.64 usr + 0.00 sys = 1.64
CPU) @​ 6097.56/s (n=10000)
ascii, utf8 cleared, bytes​: 2 wallclock secs ( 1.63 usr + 0.00 sys =
1.63 CPU) @​ 6134.97/s (n=10000)
ascii, utf8, /aa​: 1 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU) @​
6134.97/s (n=10000)
ascii, utf8, bytes​: 2 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU) @​
6134.97/s (n=10000)
  Rate ascii, utf8 cleared, /aa ascii, utf8,
bytes ascii, utf8, /aa ascii, utf8 ascii, utf8 cleared, bytes ascii, utf8
cleared ascii ascii, bytes ascii, /aa
ascii, utf8 cleared, /aa 6098/s --
-1% -1% -1% -1%
-2% -2% -3% -4%
ascii, utf8, bytes 6135/s 1%
-- -0% -0% -0%
-1% -2% -2% -3%
ascii, utf8, /aa 6135/s 1%
-0% -- -0% -0%
-1% -2% -2% -3%
ascii, utf8 6135/s 1%
0% 0% -- -0%
-1% -2% -2% -3%
ascii, utf8 cleared, bytes 6135/s 1%
0% 0% 0% --
-1% -2% -2% -3%
ascii, utf8 cleared 6211/s 2%
1% 1% 1% 1%
-- -1% -1% -2%
ascii 6250/s 2%
2% 2% 2% 2%
1% -- -1% -1%
ascii, bytes 6289/s 3%
3% 3% 3% 3%
1% 1% -- -1%
ascii, /aa 6329/s 4%
3% 3% 3% 3%
2% 1% 1% --

So, your argument about the speedup might hold some water on older perls,
but the way to get it is not necessarily "use bytes", but "if you don't
want unicode semantics, encode your strings before matching" -- and as of
blead, looks like ascii+utf8 now matches just as fast as plain ascii.

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

On Mon Aug 12 13​:57​:25 2013, ikegami@​adaelis.com wrote​:

On Mon, Aug 12, 2013 at 4​:08 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

That's just C<< utf8​::downgrade($_[0], 1) >>

Yes, you are right, except one small difference.
For characters > 127, but <= 255 it works different way.
Thus it cannot be used, when strings are filenames (like in example
above, also another example below).

(That's btw exactly like I work with it in my program
https://github.com/vsespb/mt-aws-glacier - read millions of filenames,
split, try drop utf-8 flags, and process with regexps)

use bytes ();
use utf8;
binmode STDOUT, "​:encoding(utf-8)";
use Devel​::Peek;
sub try_drop_utf8_flag
{
  Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}
sub do_downgrade
{
  utf8​::downgrade($_[0], 1)
}
my $s = "ú";
my $s1 = $s;
try_drop_utf8_flag($s1);
my $s2 = $s;
do_downgrade($s2);
Dump($s1);
Dump($s2);

die unless $s1 eq $s2;

open my $f, ">", "$s1.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$s2.tmp" or die "file not found $!";

__END__
SV = PVMG(0xfc00a0) at 0xfc1440
  REFCNT = 1
  FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x1042b90 "\303\272"\0 [UTF8 "\x{fa}"]
  CUR = 2
  LEN = 8
  MAGIC = 0x1094090
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = 1
SV = PV(0xfd6538) at 0xfc1488
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0xfdccd0 "\372"\0
  CUR = 1
  LEN = 8
file not found No such file or directory at bench3-poc.pl line 29.

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

sorry, RT corrupted this character in code (even RT don't like Latin1
chars, unlike "wide" chars, like Cyrilic !). I meant this one
http​://www.fileformat.info/info/unicode/char/da/index.htm

On Mon Aug 12 14​:17​:19 2013, vsespb wrote​:

On Mon Aug 12 13​:57​:25 2013, ikegami@​adaelis.com wrote​:

On Mon, Aug 12, 2013 at 4​:08 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

That's just C<< utf8​::downgrade($_[0], 1) >>

Yes, you are right, except one small difference.
For characters > 127, but <= 255 it works different way.
Thus it cannot be used, when strings are filenames (like in example
above, also another example below).

(That's btw exactly like I work with it in my program
https://github.com/vsespb/mt-aws-glacier - read millions of filenames,
split, try drop utf-8 flags, and process with regexps)

use bytes ();
use utf8;
binmode STDOUT, "​:encoding(utf-8)";
use Devel​::Peek;
sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}
sub do_downgrade
{
utf8​::downgrade($_[0], 1)
}
my $s = "�";
my $s1 = $s;
try_drop_utf8_flag($s1);
my $s2 = $s;
do_downgrade($s2);
Dump($s1);
Dump($s2);

die unless $s1 eq $s2;

open my $f, ">", "$s1.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$s2.tmp" or die "file not found $!";

__END__
SV = PVMG(0xfc00a0) at 0xfc1440
REFCNT = 1
FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x1042b90 "\303\272"\0 [UTF8 "\x{fa}"]
CUR = 2
LEN = 8
MAGIC = 0x1094090
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 1
SV = PV(0xfd6538) at 0xfc1488
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0xfdccd0 "\372"\0
CUR = 1
LEN = 8
file not found No such file or directory at bench3-poc.pl line 29.

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2013

From victor@vsespb.ru

On Mon Aug 12 14​:02​:02 2013, Hugmeir wrote​:

but the way to get it is not necessarily "use bytes", but "if you don't
want unicode semantics, encode your strings before matching"

Well, utf8​::encode($bytes) will change the string. So if

a) I have ASCII regexp
b) I have data, which sometimes ASCII-7-bit (in most cases), and
sometimes Unicode with wide characters
c) I want the regexp to work fast, at least when data is ASCII
d) I want to code to not be broken, if data is not ASCII.

utf8​::encode($bytes) won't work as needed. It will damage string if it's
Unicode. It won't be a character string anymore, (I might want to
process it after regexp match, or I want to use regexp match variables)

and as of
blead, looks like ascii+utf8 now matches just as fast as plain ascii.

Yes, indeed, 5.18 still slow, but blead already fast.

@p5pRT
Copy link
Author

p5pRT commented Aug 13, 2013

From @ikegami

On Mon, Aug 12, 2013 at 5​:17 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

On Mon Aug 12 13​:57​:25 2013, ikegami@​adaelis.com wrote​:

On Mon, Aug 12, 2013 at 4​:08 PM, Victor Efimov via RT <
perlbug-followup@​perl.org> wrote​:

sub try_drop_utf8_flag
{
Encode​::_utf8_off($_[0]) if utf8​::is_utf8($_[0]) &&
(bytes​::length($_[0]) == length($_[0]));
}

That's just C<< utf8​::downgrade($_[0], 1) >>

Yes, you are right, except one small difference.
For characters > 127, but <= 255 it works different way.
Thus it cannot be used, when strings are filenames (like in example
above, also another example below).

I see, but it's still not a reason to keep bytes. It simply means we need
to add a function downgrades only ASCII-only strings.

@p5pRT
Copy link
Author

p5pRT commented Aug 15, 2013

From @ap

* Victor Efimov via RT <perlbug-followup@​perl.org> [2013-08-12 23​:20]​:

file not found No such file or directory at bench3-poc.pl line 29.

That is really the last remnant (I think) of The Unicode Bug. The
problem here is that `open` and all the other file-related functions
blithely ignore the UTF8 flag, which is utterly broken.

Your use of bytes.pm is a workaround. And however useful it may be while
the bug persists, a workaround is all it is. It *isn’t* a legitimately
good use case for bytes.pm.

* Victor Efimov via RT <perlbug-followup@​perl.org> [2013-08-12 22​:40]​:

Another possible use of bytes are​:

1) run-time, production-enabled assertions (
http​://en.wikipedia.org/wiki/Assertion_%28software_development%29 ).
It's similar to debugging, except performance matters.

2) Unit tests (sometimes performance matters).

I have no idea what the concept of assertions or that of unit tests has
to do with the internal representation of strings, or how what you wrote
afterwards is related to those things.

Below example contains a bug (from Perl point view this can be treated
as not-a-bug, but from programmer point of view it's a bug).

(bug marked with "# THIS LINE CONTAINS A BUG")

OK, to cut a long story short, the line is

  my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

in which $ascii_u has the UTF8 flag set even though it contains only
characters < 128, and $bin contains characters between 128 and 255 yet
*doesn’t* have the UTF8 flag set.

Your complaint is that the concatenation blindly produces a string with
the UTF8 flag set, which requires the contents of $bin to be upgraded to
produce $bin, which takes up extra space, despite the string containing
only characters < 256.

This is not a bug, though it certainly is suboptimal.

In theory Perl string operations could try to produce downgraded strings
whenever possible, but that requires scanning string in many cases where
it currently doesn’t happen. Things would almost certainly actually get
slower.

It does not affect anything, even program output, except
performance/memory usage.

bin_u is 7 bytes length, and bin_a is 4 bytes length.

if 7 vs 4 bytes looks unimportant, consider 700 vs 400 MiB of binary files.

And this bug can be caught (runtime or in unit tests) if line "#die if
is_wide_string($bin_u);" uncommented.

The only possible way to catch this is a use of bytes​::length (or
similar function which count bytes), because final output is same with
or without bug.

No.

The only possible way to catch this is encoding​::warnings (with FATAL),
because if you try to catch it manually, you will miss places where you
would need to put checks. https://metacpan.org/module/encoding::warnings

Also, if you *already know* (some of) the places in your program where
this can happen, then the workaround is not to try to “catch” it after
it happened, but – again! – to utf8​::downgrade your strings before you
concatenate them.

  utf8​::downgrade($bin, 1);
  utf8​::downgrade($ascii_u, 1);
  my $bin_u = $bin.$ascii_u; # THIS LINE NO LONGER CONTAINS A BUG

So here too, I do not see bytes.pm being useful in any real way. In fact
in this case it doesn’t even provide a useful piece of a workaround.

* Victor Efimov via RT <perlbug-followup@​perl.org> [2013-08-12 22​:55]​:

Code below prints that two strings are same. Tries to open file with
name defined by one string, and then to reopen file with name defined
by second string. Second attempt fail.

This is exactly the same bug as in your first comment on this issue, i.e.
just The Unicode Bug in `open` and friends.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2013

From victor@vsespb.ru

On Thu Aug 15 15​:46​:39 2013, aristotle wrote​:

That is really the last remnant (I think) of The Unicode Bug.

More precisely http​://perldoc.perl.org/perlunicode.html "When Unicode
Does Not Happen"

And however useful it may be while the bug persists, a workaround is
all it is. It *isn’t* a legitimately good use case for bytes.pm.

Yes, workarounds needed until issue is fixed.

And probably until all CPAN code, which misuse UTF-8, is fixed. I found
several examples (some of them never going to be fixed - authors refuse
to do that).

https://rt.cpan.org/Public/Bug/Display.html?id=87863
https://rt.cpan.org/Public/Bug/Display.html?id=87807
https://rt.cpan.org/Public/Bug/Display.html?id=30271
akarelas/xml-myxml#2

I have no idea what the concept of assertions or that of unit tests
has to do with the internal representation of strings

OK, to cut a long story short, the line is
...

Exactly. Your explanation is correct.

This is not a bug, though it certainly is suboptimal.

Of course I agree that this is a feature, not a bug.
Point was it's suboptimal.
That is why I need to check it in assertions and unit tests ("unit
tests" is opposite to what is said in bytes.pm "use only for debugging
purposes")

encoding​::warnings

Seems a great module. At least great idea. However for my case it does
not work or I misunderstood its usage (it does not catch error and
actually silently fixes the "Unicode bug" with filenames in perl - i.e.
with this pragma program behaves differently)

=======================
use Encode;
use utf8;
use strict;
use warnings;
my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"
my $bin = "\xf1\xf2\xf3";
my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");

print "original bin length​:\t";
print length($bin) . "\t" . bytes​::length($bin) ."\n";

my $bin_a = do {
use encoding​::warnings 'FATAL';
$bin.$ascii;
};

print "bin_a length​:\t";
print length($bin_a) . "\t" . bytes​::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

die unless $bin_u eq $bin_a;
print "bin_u and bin_a are same!\n";

use Devel​::Peek;
Dump $bin_a; Dump $bin_u;

open my $f, ">", "$bin_u.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$bin_a.tmp" or die "file not found $!";

because if you try to catch it manually, you will miss places where
you would need to put checks
Also, if you *already know* (some of) the places

that's an idea of unit tests - catch bugs in known places.

utf8​::downgrade($bin, 1);
utf8​::downgrade($ascii_u, 1);
my $bin_u = $bin.$ascii_u; # THIS LINE NO LONGER CONTAINS A BUG

Yes. This is a fix for the bug. Now I need to unit test the fix with
bytes​::length or encoding​::warnings. (i.e. a practice to write tests
after bug found)

This is exactly the same bug as in your first comment on this issue

Yes, I reposted a bit re-worked example when answered another comment.

@p5pRT
Copy link
Author

p5pRT commented Aug 16, 2013

From victor@vsespb.ru

UPD​:
encoding​::warnings is indeed broken (bug opened 5 years ago)
https://rt.cpan.org/Public/Bug/Display.html?id=33989
and cannot be fixed.

On Thu Aug 15 17​:45​:08 2013, vsespb wrote​:

On Thu Aug 15 15​:46​:39 2013, aristotle wrote​:

That is really the last remnant (I think) of The Unicode Bug.

More precisely http​://perldoc.perl.org/perlunicode.html "When Unicode
Does Not Happen"

And however useful it may be while the bug persists, a workaround is
all it is. It *isn’t* a legitimately good use case for bytes.pm.

Yes, workarounds needed until issue is fixed.

And probably until all CPAN code, which misuse UTF-8, is fixed. I found
several examples (some of them never going to be fixed - authors refuse
to do that).

https://rt.cpan.org/Public/Bug/Display.html?id=87863
https://rt.cpan.org/Public/Bug/Display.html?id=87807
https://rt.cpan.org/Public/Bug/Display.html?id=30271
akarelas/xml-myxml#2

I have no idea what the concept of assertions or that of unit tests
has to do with the internal representation of strings

OK, to cut a long story short, the line is
...

Exactly. Your explanation is correct.

This is not a bug, though it certainly is suboptimal.

Of course I agree that this is a feature, not a bug.
Point was it's suboptimal.
That is why I need to check it in assertions and unit tests ("unit
tests" is opposite to what is said in bytes.pm "use only for debugging
purposes")

encoding​::warnings

Seems a great module. At least great idea. However for my case it does
not work or I misunderstood its usage (it does not catch error and
actually silently fixes the "Unicode bug" with filenames in perl - i.e.
with this pragma program behaves differently)

=======================
use Encode;
use utf8;
use strict;
use warnings;
my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест"
my $bin = "\xf1\xf2\xf3";
my $ascii = "x";
my ($ascii_u, undef) = split(/ /, "$ascii $u");

print "original bin length​:\t";
print length($bin) . "\t" . bytes​::length($bin) ."\n";

my $bin_a = do {
use encoding​::warnings 'FATAL';
$bin.$ascii;
};

print "bin_a length​:\t";
print length($bin_a) . "\t" . bytes​::length($bin_a) ."\n";

my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG

die unless $bin_u eq $bin_a;
print "bin_u and bin_a are same!\n";

use Devel​::Peek;
Dump $bin_a; Dump $bin_u;

open my $f, ">", "$bin_u.tmp";
binmode $f;
syswrite $f, "TEST";
close $f;

open $f, "<", "$bin_a.tmp" or die "file not found $!";

because if you try to catch it manually, you will miss places where
you would need to put checks
Also, if you *already know* (some of) the places

that's an idea of unit tests - catch bugs in known places.

utf8​::downgrade($bin, 1);
utf8​::downgrade($ascii_u, 1);
my $bin_u = $bin.$ascii_u; # THIS LINE NO LONGER CONTAINS A BUG

Yes. This is a fix for the bug. Now I need to unit test the fix with
bytes​::length or encoding​::warnings. (i.e. a practice to write tests
after bug found)

This is exactly the same bug as in your first comment on this issue

Yes, I reposted a bit re-worked example when answered another comment.

@p5pRT
Copy link
Author

p5pRT commented Mar 6, 2014

From @khwilliamson

Changing the ticket name as the original one has been fixed,

Karl Williamson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants