Skip Menu |
Report information
Id: 130907
Status: resolved
Priority: 0/
Queue: perl5

Owner: arc <arc [at] cpan.org>
Requestors: mauke- <l.mai [at] web.de>
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: low
Type: core
Perl Version: 5.24.1
Fixed In: (no value)

Attachments
0001-RT-130907-Fix-the-Unicode-Bug-in-split.patch



Date: Fri, 03 Mar 2017 16:23:35 +0100
To: perlbug [...] perl.org
Subject: unicode_strings feature doesn't work with split ' '
From: l.mai [...] web.de
Download (untitled) / with headers
text/plain 3.4k
This is a bug report for perl from l.mai@web.de, generated with the help of perlbug 1.40 running under perl 5.24.1. ----------------------------------------------------------------- [Please describe your issue here] $ cat bug.pl #!perl use strict; use warnings; use utf8; use feature qw(unicode_strings); print "case 1: $_\n" for split ' ', "A B\x{A0}C"; print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; __END__ $ perl bug.pl case 1: A case 1: B C case 2: A case 2: B case 2: C€ In case 1 the \x{A0} (no-break space) is not treated as whitespace, so we get 2 elements: A and B<nbsp>C. In case 2 the \x{A0} is treated as whitespace and we get 3 elements: A, B and C<euro>. I thought feature 'unicode_strings' would make these behave the same, regardless of whether a string literal contains a >255 character or not. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.24.1: Configured by mauke at Sun Feb 19 23:06:44 CET 2017. Summary of my perl5 (revision 5 version 24 subversion 1) configuration: Platform: osname=linux, osvers=4.9.6-1-arch, archname=i686-linux uname='linux simplicio 4.9.6-1-arch #1 smp preempt thu jan 26 09:41:20 cet 2017 i686 gnulinux ' config_args='' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -flto', cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='', gccversion='6.3.1 20170109', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234, doublekind=3 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12, longdblkind=3 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags ='-fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/i686-pc-linux-gnu/6.3.1/include-fixed /usr/lib /lib libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.24.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.24' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -flto -L/usr/local/lib -fstack-protector-strong' --- @INC for perl 5.24.1: /home/mauke/usr/lib/perl5/site_perl/5.24.1/i686-linux /home/mauke/usr/lib/perl5/site_perl/5.24.1 /home/mauke/usr/lib/perl5/5.24.1/i686-linux /home/mauke/usr/lib/perl5/5.24.1 --- Environment for perl 5.24.1: HOME=/home/mauke LANG=en_US.UTF-8 LANGUAGE=en_US LC_COLLATE=C LC_MONETARY=de_DE.UTF-8 LC_TIME=de_DE.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/mauke/perl5/perlbrew/bin:/home/mauke/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl PERLBREW_BASHRC_VERSION=0.73 PERLBREW_HOME=/home/mauke/.perlbrew PERLBREW_ROOT=/home/mauke/perl5/perlbrew PERL_BADLANG (unset) PERL_UNICODE=SAL SHELL=/bin/bash
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.8k
On Fri, 03 Mar 2017 15:24:10 GMT, mauke- wrote: Show quoted text
> > This is a bug report for perl from l.mai@web.de, > generated with the help of perlbug 1.40 running under perl 5.24.1. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > $ cat bug.pl > #!perl > use strict; > use warnings; > > use utf8; > use feature qw(unicode_strings); > > print "case 1: $_\n" for split ' ', "A B\x{A0}C"; > > print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; > __END__ > > $ perl bug.pl > case 1: A > case 1: B C > case 2: A > case 2: B > case 2: C€ > > > In case 1 the \x{A0} (no-break space) is not treated as whitespace, so > we get 2 > elements: A and B<nbsp>C. > > In case 2 the \x{A0} is treated as whitespace and we get 3 elements: > A, B and > C<euro>. > > I thought feature 'unicode_strings' would make these behave the same, > regardless of whether a string literal contains a >255 character or > not. >
It appears that the output is affected by whether or not 'use utf8;' appears in the file. ##### $ cat 130907-no-use-utf8-unicode_strings.pl #!/usr/bin/env perl use strict; use warnings; #use utf8; use feature 'unicode_strings'; print "case 1: $_\n" for split ' ', "A B\x{A0}C"; print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; $ perl 130907-no-use-utf8-unicode_strings.pl case 1: A case 1: B�C case 2: A case 2: B�C€ ##### $ cat 130907-use-utf8-unicode_strings.pl #!/usr/bin/env perl use strict; use warnings; use utf8; use feature 'unicode_strings'; print "case 1: $_\n" for split ' ', "A B\x{A0}C"; print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; [p5p] 512 $ perl 130907-use-utf8-unicode_strings.pl case 1: A case 1: B�C case 2: A case 2: B Wide character in print at 130907-use-utf8-unicode_strings.pl line 8. case 2: C€ ##### -- James E Keenan (jkeenan@cpan.org)
To: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #130907] unicode_strings feature doesn't work with split ' '
CC: "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Date: Sat, 4 Mar 2017 13:05:25 +0000
From: Aaron Crane <arc [...] cpan.org>
Download (untitled) / with headers
text/plain 867b
l.mai@web.de <perlbug-followup@perl.org> wrote: Show quoted text
> use utf8; > use feature qw(unicode_strings); > > print "case 1: $_\n" for split ' ', "A B\x{A0}C"; > > print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; > __END__ > > $ perl bug.pl > case 1: A > case 1: B C > case 2: A > case 2: B > case 2: C€ > > > In case 1 the \x{A0} (no-break space) is not treated as whitespace, so we get 2 > elements: A and B<nbsp>C. > > In case 2 the \x{A0} is treated as whitespace and we get 3 elements: A, B and > C<euro>. > > I thought feature 'unicode_strings' would make these behave the same, > regardless of whether a string literal contains a >255 character or not.
I agree; this is another instance of the Unicode Bug. I've attached a patch to fix this, but given the current depth of the freeze, I'll aim to merge it for 5.27.1. -- Aaron Crane ** http://aaroncrane.co.uk/

Message body is not shown because sender requested not to inline it.

RT-Send-CC: perl5-porters [...] perl.org
On Sat, 04 Mar 2017 13:06:16 GMT, arc wrote: Show quoted text
> l.mai@web.de <perlbug-followup@perl.org> wrote:
> > use utf8; > > use feature qw(unicode_strings); > > > > print "case 1: $_\n" for split ' ', "A B\x{A0}C"; > > > > print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; > > __END__ > > > > $ perl bug.pl > > case 1: A > > case 1: B C > > case 2: A > > case 2: B > > case 2: C€ > > > > > > In case 1 the \x{A0} (no-break space) is not treated as whitespace, > > so we get 2 > > elements: A and B<nbsp>C. > > > > In case 2 the \x{A0} is treated as whitespace and we get 3 elements: > > A, B and > > C<euro>. > > > > I thought feature 'unicode_strings' would make these behave the same, > > regardless of whether a string literal contains a >255 character or > > not.
> > I agree; this is another instance of the Unicode Bug. > > I've attached a patch to fix this, but given the current depth of the > freeze, I'll aim to merge it for 5.27.1.
Which gives us plenty of time to smoke test this ... so I have created this branch: smoke-me/jkeenan/arc/130907-unicode-bug-in-split -- James E Keenan (jkeenan@cpan.org)
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.4k
On Sat, 04 Mar 2017 14:00:38 GMT, jkeenan wrote: Show quoted text
> On Sat, 04 Mar 2017 13:06:16 GMT, arc wrote:
> > l.mai@web.de <perlbug-followup@perl.org> wrote:
> > > use utf8; > > > use feature qw(unicode_strings); > > > > > > print "case 1: $_\n" for split ' ', "A B\x{A0}C"; > > > > > > print "case 2: $_\n" for split ' ', "A B\x{A0}C€"; > > > __END__ > > > > > > $ perl bug.pl > > > case 1: A > > > case 1: B C > > > case 2: A > > > case 2: B > > > case 2: C€ > > > > > > > > > In case 1 the \x{A0} (no-break space) is not treated as whitespace, > > > so we get 2 > > > elements: A and B<nbsp>C. > > > > > > In case 2 the \x{A0} is treated as whitespace and we get 3 > > > elements: > > > A, B and > > > C<euro>. > > > > > > I thought feature 'unicode_strings' would make these behave the > > > same, > > > regardless of whether a string literal contains a >255 character or > > > not.
> > > > I agree; this is another instance of the Unicode Bug. > > > > I've attached a patch to fix this, but given the current depth of the > > freeze, I'll aim to merge it for 5.27.1.
> > Which gives us plenty of time to smoke test this ... so I have created > this branch: > > smoke-me/jkeenan/arc/130907-unicode-bug-in-split
Aaron, with perl-5.26.0 released, would you like to take the discussion of this ticket forward? Smoke test results here: http://perl.develop-help.com/?b=smoke-me%2Fjkeenan%2Farc%2F130907-unicode-bug-in-split -- James E Keenan (jkeenan@cpan.org)
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 239b
The (minority of) failing reports from that smoke-me branch seem to be unrelated to this change. I've therefore pushed a rebased version of my patch as 20ae58f7a9bbf84d043d6e90f5988b6e3ca4ee3d -- Aaron Crane ** http://aaroncrane.co.uk/


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org