New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What to do about 'use bytes' #12882
Comments
From @HugmeirCreated by @Hugmeir$_ = "\x{30cb}"; ucfirst & lcfirst return a UTF-8 flagged scalar, while the first three Perl Info
|
From @ap* Brian Fraser <perlbug-followup@perl.org> [2013-03-26 01:50]:
Is it worth fixing something to follow a semantic that itself is broken I’m not sure if we had an explicit consensus about bytes.pm being highly Regards, |
The RT System itself - Status changed from 'new' to 'open' |
From @ikegamiOn Mon, Mar 25, 2013 at 9:50 PM, Aristotle Pagaltzis <pagaltzis@gmx.de>wrote:
"and use of this module for anything other than debugging purposes is
"This module is deprecated under perl 5.18. It uses a mechanism provided by At least publicly, it's not quite the same level. but I would be happy if we could move it in that direction; and the
Indeed. If you have to deal with a buggy module, should be using |
From @iabynOn Tue, Mar 26, 2013 at 02:50:34AM +0100, Aristotle Pagaltzis wrote:
From the top of the pod in bytes.pm, added for 5.12.0: =head1 NOTICE This pragma reflects early attempts to incorporate Unicode into perl and -- |
From @khwilliamsonOn 03/25/2013 11:45 PM, Eric Brine wrote:
Attached is a patch that fixes the original report. The code it changes commit d54190f lcfirst/ucfirst plus an 8 bit locale could mangle UTF-8 values I was the one who added the comments much later. I was trying to make I'm tempted to apply the patch unless someone can say why it would break |
From @khwilliamson0036-XXX-Patch-for-discussion-perl-117355-lu-cfirst-don-t.patchFrom adaf7fd4f5bfc1575a0c9fc185002d9d6d5ca11a Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 4 May 2013 20:42:48 -0600
Subject: [PATCH 36/36] XXX Patch for discussion: [perl #117355] [lu]cfirst
don't respect 'use bytes'
---
pp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/pp.c b/pp.c
index 865438b..d6e18e0 100644
--- a/pp.c
+++ b/pp.c
@@ -3655,7 +3655,7 @@ PP(pp_ucfirst)
/* In a "use bytes" we don't treat the source as UTF-8, but, still want
* the destination to retain that flag */
- if (SvUTF8(source))
+ if (SvUTF8(source) && ! IN_BYTES)
SvUTF8_on(dest);
if (!inplace) { /* Finish the rest of the string, unchanged */
--
1.8.1.3
|
From @tonycozOn Sat May 04 20:03:46 2013, public@khwilliamson.com wrote:
As much as I despise use bytes, I think this patch could go in, but it If no-one else provides tests I'll write some over the next few days. Or not, if people object to the change, in which case they should Tony |
From @tonycozOn Thu Jul 04 18:40:10 2013, tonyc wrote:
Attached, some very basic tests, bytes.pm doesn't deserve much more :) Tony |
From @tonycoz0001-perl-117355-very-basic-tests-for-ul-c-first-under-us.patchFrom 40d56eaa790175e9c2b803c17bc5c38e03dc5ddc Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 15 Jul 2013 16:06:46 +1000
Subject: [PATCH 1/3] [perl #117355] very basic tests for [ul]c(first)? under
use bytes
the [lu]cfirst tests are TODO due to #117355
---
lib/bytes.t | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/lib/bytes.t b/lib/bytes.t
index c1ea9ea..6ac18df 100644
--- a/lib/bytes.t
+++ b/lib/bytes.t
@@ -5,7 +5,7 @@ BEGIN {
require './test.pl';
}
-plan tests => 20;
+plan tests => 24;
my $a = chr(0x100);
@@ -28,6 +28,8 @@ is(bytes::chr(0x100), chr(0), "bytes::chr sanity check");
}
my $c = chr(0x100);
+my $c2 = chr(0x2c7); # a unicode character that doesn't fold
+utf8::encode(my $c2_utf8 = $c2);
{
use bytes;
@@ -56,6 +58,13 @@ my $c = chr(0x100);
is(bytes::rindex($c, "\xc4"), 0, "bytes::rindex under use bytes looks at bytes");
}
+ # [perl #117355] [lu]cfirst don't respect 'use bytes'
+ # and if there's other tests for lc/uc under bytes I didn't find them
+ is(lc($c2), $c2_utf8, "lc under use bytes returns bytes");
+ is(uc($c2), $c2_utf8, "uc under use bytes returns bytes");
+ local $TODO = "[perl #117355] [lu]cfirst don't respect 'use bytes'";
+ is(lcfirst($c2), $c2_utf8, "lcfirst under use bytes returns bytes");
+ is(ucfirst($c2), $c2_utf8, "unfirst under use bytes returns bytes");
}
{
--
1.7.10.4
|
From @cpansproutOn Tue Mar 26 03:50:37 2013, davem wrote:
What can we do to upgrade this to a deprecation? -- Father Chrysostomos |
From @rjbsOn Sun Jul 14 23:54:35 2013, sprout wrote:
I'm not sure. The question is: do we propose to allow bytes.pm to become an external library? Can we do I haven't given this a lot of thought, but I think that if we can make bytes.pm ejectable, we Thoughts? Objections? -- |
From @tonycozOn Sun Jul 14 23:22:33 2013, tonyc wrote:
Applied as ac99361, I've leave the ticket open for the deprecation discussion, though Tony |
From @cpansproutOn Sun Aug 11 19:56:53 2013, rjbs wrote:
How much of it would we reimplement? If we want to keep its current behaviour, we would end up having to Are you suggesting just a subset of the behaviour? -- Father Chrysostomos |
From @cpansproutOn Sun Aug 11 23:22:25 2013, sprout wrote:
Just deprecating bytes.pm outright, not just deprecating it from core, In 5.20 and 5.22 it warns ‘Use of bytes.pm is deprecated’. -- Father Chrysostomos |
From victor@vsespb.ruOn Sun Aug 11 19:56:53 2013, rjbs wrote:
Possible use bytes to speedup regexps by 20-40% at some cases: use warnings; sub test sub try_drop_utf8_flag my $ascii = "x"; die unless $ascii_u eq $ascii; cmpthese(-1,{ __END__ perl-5.18.0 Rate ascii with utf8 on ascii with utf8 perl-5.10.0 Rate ascii with utf8 on ascii with utf8 |
From victor@vsespb.ruOn Sun Aug 11 19:56:53 2013, rjbs wrote:
Another possible use of bytes are: 1) run-time, production-enabled assertions ( 2) Unit tests (sometimes performance matters). Below example contains a bug (from Perl point view this can be treated (bug marked with "# THIS LINE CONTAINS A BUG") It does not affect anything, even program output, except bin_u is 7 bytes length, and bin_a is 4 bytes length. if 7 vs 4 bytes looks unimportant, consider 700 vs 400 MiB of binary files. And this bug can be caught (runtime or in unit tests) if line "#die if The only possible way to catch this is a use of bytes::length (or ===== use Encode; sub is_wide_string my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест" # plain binary data, for example part of binary file (say, JPEG) my $ascii = "x"; print "original bin length:\t"; my $bin_a = $bin.$ascii; print "bin_a length:\t"; my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG #die if is_wide_string($bin_u); print "bin_u length:\t"; open my $f, ">", "file_a.tmp"; open $f, ">", "file_u.tmp"; system("md5sum file_?.tmp"); __END__ |
From victor@vsespb.ruOn Sun Aug 11 19:56:53 2013, rjbs wrote: And another edge case when perl unicode not working well, is filenames. Code below prints that two strings are same. Tries to open file with So that is another case when user might want to use bytes::xxx, (btw, I have a program which have to deal with non-UTF filesystems, this ======== use Encode; my $u = "\x{442}\x{435}\x{441}\x{442}"; # same as "тест" # plain binary data, for example part of binary file (say, JPEG) my $ascii = "x"; print "original bin length:\t"; my $bin_a = $bin.$ascii; print "bin_a length:\t"; my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG die unless $bin_u eq $bin_a; open my $f, ">", "$bin_u.tmp"; open $f, "<", "$bin_a.tmp" or die "file not found $!"; |
From @ikegamiOn Mon, Aug 12, 2013 at 4:08 PM, Victor Efimov via RT <
That's just C<< utf8::downgrade($_[0], 1) >> |
From @HugmeirOn Mon, Aug 12, 2013 at 5:08 PM, Victor Efimov via RT <
I've tweaked your benchmark a bit: my $ascii = ("x" x 400) . "z"; utf8::encode($bytes); die unless $ascii_u eq $ascii; cmpthese(timethese(10000,{ Which gives me, on 5.14.2: On 5.16.2: (skipping 5.18 because I compiled it with different options than everything So, your argument about the speedup might hold some water on older perls, |
From victor@vsespb.ruOn Mon Aug 12 13:57:25 2013, ikegami@adaelis.com wrote:
Yes, you are right, except one small difference. (That's btw exactly like I work with it in my program use bytes (); die unless $s1 eq $s2; open my $f, ">", "$s1.tmp"; open $f, "<", "$s2.tmp" or die "file not found $!"; __END__ |
From victor@vsespb.rusorry, RT corrupted this character in code (even RT don't like Latin1 On Mon Aug 12 14:17:19 2013, vsespb wrote:
|
From victor@vsespb.ruOn Mon Aug 12 14:02:02 2013, Hugmeir wrote:
Well, utf8::encode($bytes) will change the string. So if a) I have ASCII regexp utf8::encode($bytes) won't work as needed. It will damage string if it's
Yes, indeed, 5.18 still slow, but blead already fast. |
From @ikegamiOn Mon, Aug 12, 2013 at 5:17 PM, Victor Efimov via RT <
I see, but it's still not a reason to keep bytes. It simply means we need |
From @ap* Victor Efimov via RT <perlbug-followup@perl.org> [2013-08-12 23:20]:
That is really the last remnant (I think) of The Unicode Bug. The Your use of bytes.pm is a workaround. And however useful it may be while * Victor Efimov via RT <perlbug-followup@perl.org> [2013-08-12 22:40]:
I have no idea what the concept of assertions or that of unit tests has
OK, to cut a long story short, the line is my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG in which $ascii_u has the UTF8 flag set even though it contains only Your complaint is that the concatenation blindly produces a string with This is not a bug, though it certainly is suboptimal. In theory Perl string operations could try to produce downgraded strings
No. The only possible way to catch this is encoding::warnings (with FATAL), Also, if you *already know* (some of) the places in your program where utf8::downgrade($bin, 1); So here too, I do not see bytes.pm being useful in any real way. In fact * Victor Efimov via RT <perlbug-followup@perl.org> [2013-08-12 22:55]:
This is exactly the same bug as in your first comment on this issue, i.e. Regards, |
From victor@vsespb.ruOn Thu Aug 15 15:46:39 2013, aristotle wrote:
More precisely http://perldoc.perl.org/perlunicode.html "When Unicode
Yes, workarounds needed until issue is fixed. And probably until all CPAN code, which misuse UTF-8, is fixed. I found https://rt.cpan.org/Public/Bug/Display.html?id=87863
Exactly. Your explanation is correct.
Of course I agree that this is a feature, not a bug.
Seems a great module. At least great idea. However for my case it does ======================= print "original bin length:\t"; my $bin_a = do { print "bin_a length:\t"; my $bin_u = $bin.$ascii_u; # THIS LINE CONTAINS A BUG die unless $bin_u eq $bin_a; use Devel::Peek; open my $f, ">", "$bin_u.tmp"; open $f, "<", "$bin_a.tmp" or die "file not found $!";
that's an idea of unit tests - catch bugs in known places.
Yes. This is a fix for the bug. Now I need to unit test the fix with
Yes, I reposted a bit re-worked example when answered another comment. |
From victor@vsespb.ruUPD: On Thu Aug 15 17:45:08 2013, vsespb wrote:
|
From @khwilliamsonChanging the ticket name as the original one has been fixed, Karl Williamson |
Migrated from rt.perl.org#117355 (status was 'open')
Searchable as RT117355$
The text was updated successfully, but these errors were encountered: