Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

B::Deparse fails at UTF-8 in regexes #13343

Closed
p5pRT opened this issue Oct 10, 2013 · 10 comments
Closed

B::Deparse fails at UTF-8 in regexes #13343

p5pRT opened this issue Oct 10, 2013 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 10, 2013

Migrated from rt.perl.org#120182 (status was 'resolved')

Searchable as RT120182$

@p5pRT
Copy link
Author

p5pRT commented Oct 10, 2013

From @mauke

Created by @mauke

% perl -MO=Deparse -e 'use utf8; /€/'
use utf8;
/\342\202\254/;
-e syntax OK

Expected​:
use utf8;
/\x{20AC}/;

Perl Info

Flags:
    category=library
    severity=low
    module=B::Deparse

This perlbug was built using Perl 5.12.1 - Thu Jun  3 20:09:15 CEST 2010
It is being executed now by  Perl 5.18.1 - Tue Aug 13 07:08:47 CEST 2013.

Site configuration information for perl 5.18.1:

Configured by mauke at Tue Aug 13 07:08:47 CEST 2013.

Summary of my perl5 (revision 5 version 18 subversion 1) configuration:
   
  Platform:
    osname=linux, osvers=3.5.7-gentoo, archname=i686-linux
    uname='linux nora 3.5.7-gentoo #5 preempt sat jan 26 16:46:10 cet 2013 i686 amd athlon(tm) 64 processor 3200+ authenticamd gnulinux '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -flto',
    cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.8.1', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags ='-O2 -flto -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.15.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.15'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -flto -L/usr/local/lib -fstack-protector'

Locally applied patches:
    SAVEARGV0 - disable magic open in <ARGV>


@INC for perl 5.18.1:
    /home/mauke/usr/local/lib/perl5/site_perl/5.18.1/i686-linux
    /home/mauke/usr/local/lib/perl5/site_perl/5.18.1
    /home/mauke/usr/local/lib/perl5/5.18.1/i686-linux
    /home/mauke/usr/local/lib/perl5/5.18.1
    .


Environment for perl 5.18.1:
    HOME=/home/mauke
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=POSIX
    LD_LIBRARY_PATH=/home/mauke/usr/local/lib
    LOGDIR (unset)
    PATH=/home/mauke/usr/perlbrew/bin:/home/mauke/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.6.3:/opt/sun-jdk-1.4.2.13/bin:/opt/sun-jdk-1.4.2.13/jre/bin:/opt/sun-jdk-1.4.2.13/jre/javaws:/opt/dmd/bin:/usr/games/bin
    PERLBREW_BASHRC_VERSION=0.43
    PERLBREW_HOME=/home/mauke/.perlbrew
    PERLBREW_PATH=/home/mauke/usr/perlbrew/bin
    PERLBREW_ROOT=/home/mauke/usr/perlbrew
    PERLBREW_VERSION=0.27
    PERL_BADLANG (unset)
    PERL_UNICODE=SAL
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From @iabyn

On Thu, Oct 10, 2013 at 01​:46​:59PM -0700, l.mai@​web.de wrote​:

% perl -MO=Deparse -e 'use utf8; /€/'
use utf8;
/\342\202\254/;
-e syntax OK

Expected​:
use utf8;
/\x{20AC}/;

Fixed with fea7fb2.

I also noticed while looking into this, that Deparse (mostly) outputs
chars > 255 as-is rather than converting to \x{} format, e.g.​:

  $ p -MO=Deparse -e'$x="\x{100}"'
  Wide character in print at lib/B/Deparse.pm line 1355.
  $x = 'Ā';
  -e syntax OK

This is because shortly after 5.6.0, the 'does this char need escaping'
test was changed from an explicit range to /[[​:print​:]]/. Then around
5.8.0, the meaning of [[​:print​:]] changed to include chars > 255 as
printable. So I think the original intent of the code was to display big
chars in \x{} form. I could revert to the old behaviour by changing all
occurrences of [[​:print​:]] to [^\p{PosixPrint}].

On the other hand, this has now been long-standing behaviour, and no-one
seems to have complained much. What do people think?

The other thing that occurs to me is that perhaps Deparse should be clever
enough that if it is within the scope of 'use utf8' then it outputs big
literal chars as a series of utf8 bytes, and as a \x{} escape otherwise?

--
Standards (n). Battle insignia or tribal totems.

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From zefram@fysh.org

Dave Mitchell wrote​:

On the other hand, this has now been long-standing behaviour, and no-one
seems to have complained much. What do people think?

We're better off with the deparser generating pure ASCII output.

       if it is within the scope of 'use utf8' then it outputs big

literal chars as a series of utf8 bytes, and as a \x{} escape otherwise?

If only "use utf8" had such a singular purpose. Actually, if "use utf8"
only meant "this source is in UTF-8" then we wouldn't need to store it as
a hint bit in COPs, and so we wouldn't have the hint available to control
the deparser's output format. And even if a non-overloaded "this source
is in UTF-8" flag were available, that's still not the same thing as
"all Unicode characters are acceptable to the author of this source".
And even that's not the same as "non-ASCII characters are acceptable to
the consumer of this deparse".

-zefram

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From @Tux

On Wed, 23 Oct 2013 15​:35​:19 +0100, Zefram <zefram@​fysh.org> wrote​:

Dave Mitchell wrote​:

On the other hand, this has now been long-standing behaviour, and no-one
seems to have complained much. What do people think?

We're better off with the deparser generating pure ASCII output.

I agree, and even if most of my terminal sessions are fully UTF-8
aware, I would still prefer "\x{20ac}" over "€" in Deparse

<dreaming>Deparse could get optional featured behavior to instead
  spew "\N{EURO SIGN}"</dreaming>

       if it is within the scope of 'use utf8' then it outputs big

literal chars as a series of utf8 bytes, and as a \x{} escape otherwise?

If only "use utf8" had such a singular purpose. Actually, if "use utf8"
only meant "this source is in UTF-8" then we wouldn't need to store it as
a hint bit in COPs, and so we wouldn't have the hint available to control
the deparser's output format. And even if a non-overloaded "this source
is in UTF-8" flag were available, that's still not the same thing as
"all Unicode characters are acceptable to the author of this source".
And even that's not the same as "non-ASCII characters are acceptable to
the consumer of this deparse".

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From @ikegami

On Wed, Oct 23, 2013 at 10​:25 AM, Dave Mitchell <davem@​iabyn.com> wrote​:

On the other hand, this has now been long-standing behaviour, and no-one
seems to have complained much. What do people think?

I figured it was that way for backwards compatibility, so I always use
Useqq=1.

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From @ikegami

On Wed, Oct 23, 2013 at 10​:35 AM, Zefram <zefram@​fysh.org> wrote​:

Actually, if "use utf8"
only meant "this source is in UTF-8" then we wouldn't need to store it as
a hint bit in COPs, and ...

What else does it mean?

@p5pRT
Copy link
Author

p5pRT commented Oct 23, 2013

From zefram@fysh.org

Eric Brine wrote​:

What else does it mean?

It's a lexical pragma visible to subroutines via caller(). Formerly the
pragma had direct effects on some core ops, for example​:

$ perl5.6.2 -lwe '$a="\x{cc}"; ($b)=($a=/(.*)/s); ($c)=do{ use utf8; ($a=/(.*)/s) }; print length for $b, $c'
2
1

so code that tries to be portable between core versions may "use utf8"
for this effect on older perls. The use of the pragma can be made
conditional as "use if $] < 5.008, 'utf8'", but until now there hasn't
been a pressing reason to write it that way.

"use utf8" can also be used to ensure that the subroutines utf8​::upgrade()
et al are available. They're actually available regardless of the pragma,
on those perls that provide them at all, but initially the documentation
wasn't clear about that, so treating it as a module to be loaded has its
attractions. I've written "use utf8 ()" for this purpose, not invoking
the pragma, but hardly anyone uses "()" this way to suppress importation.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

From @mauke

On Wed, 23 Oct 2013 07​:27​:12 -0700, davem wrote​:

On Thu, Oct 10, 2013 at 01​:46​:59PM -0700, l.mai@​web.de wrote​:

% perl -MO=Deparse -e 'use utf8; /€/'
use utf8;
/\342\202\254/;
-e syntax OK

Expected​:
use utf8;
/\x{20AC}/;

Fixed with fea7fb2.

Closing as resolved.

@p5pRT
Copy link
Author

p5pRT commented Nov 11, 2016

@mauke - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant