Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: use heuristic for utf8 usage w/-Mutf8 in PERL5OPT #16551

Open
p5pRT opened this issue May 7, 2018 · 3 comments
Open

RFE: use heuristic for utf8 usage w/-Mutf8 in PERL5OPT #16551

p5pRT opened this issue May 7, 2018 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented May 7, 2018

Migrated from rt.perl.org#133183 (status was 'open')

Searchable as RT133183$

@p5pRT
Copy link
Author

p5pRT commented May 7, 2018

From perl-diddler@tlinx.org

Created by perl-diddler@tlinx.org

For some time I had an odd output in one of my programs
where I tried to use a right-pointing double angle quotation mark
U+00BB (»). It always came out as "»". I had "use utf8;" in
my source, even had use utf8​::all; in some, but most of all,
thought I was safe with "-Mutf8 -CSA" in PERL5OPT.

Once I'd finished development on older module, I simply
used it. If I ran the module as a prog under the debugger,
it seemed to work -- problem was that I simply wanted
perl to assume modern sources should be treated as
utf8, or at worst to output the same bytes as on input.
bash does this​:

a="»"
printf "%s\n" "$a"
»
printf "%s\n" "$a"|hexdump -C
00000000 c2 bb 0a
---

C does this​:

#include <stdio.h>
int main(int argc, char *argv[]) {
  char arr[3]="»";
  printf("%s\n", arr);
}

gcc ar.c -o ar
ar
»

I can't think of any language that forces
0x80-0xff into a different encoding in source or input
than it outputs.

*Ideally*, perl wouldn't either. However, some would complain
of compat probs (though didn't seem to cause end of the world
for bash or C doing it that I'm aware of).

BUT, at the very least... a compromise heuristic could
be used. A first level heuristic would be​:

1) if 0xc2 or 0xc3 followed by another hex byte in the range
0x80-0xff, occurs in source, presume it is utf8 encoded.

For some though, that would still let too much incompat slip
through.

To that I say, add​:

2) if the ENV var PERL5OPT has -Mutf8 in it -- AND if "1"
then assume source is utf8. It might not be 100%
compatible, BUT, it lets local user set a presumption
for their system. If they run into a module that
doesn't work -- they can work around it. Alternatively,
have perl access a site config file (I think it can be
configured to use one in /etc/?) where they flag can
specify it.

if more safety was wanted,
as a addon step to 1 or 2 --
2) or 3) put out a one-time warning with the first byte combo that
triggers utf8 encoding on a per-module basis. That way,
either the user could silence the warning, or simply
add 'use utf8' to the beginning of that module (the
latter being more logical).

-----------------

Tangential, but related​: Additionally, if a config file is
used -- it should be possible to specify stdin/out/err as
defaulting to the locale -- the assumption being that
streamed I/O is not how one would normally access binary
data. The idea being to have perl be [mostly] binary clean
in regards to streamed input & output (I realize some want
to flag errors on invalid utf8 -- not my first choice, but
I don't see a problem with that in streamed i/o as the
assumption is one wouldn't use a variable length encoding
for storing binary data.

This might assist in putting the infamous perl utf8 bug
to rest (at least for the most part). It also introduces
the idea of trying to give or do what the user wants based on
increasing levels of evidence. Admittedly an imperfect
science, but better than using rigid standards when it comes
to humans. Perl should "just be smarter".

This isn't version related as it happens under perl 5.24.0
as well as 5.16.3.

Perl Info

Flags:
    category=core
    severity=wishlist

Site configuration information for perl 5.16.3:

Configured by law at Wed Jan 22 12:58:58 PST 2014.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:
   
  Platform:
    osname=linux, osvers=3.12.0-isht-van, archname=x86_64-linux-thread-multi-ld
    uname='linux ishtar 3.12.0-isht-van #1 smp preempt wed nov 13 16:50:51 pst 2013 x86_64 x86_64 x86_64 gnulinux '
    config_args=''
    hint=previous, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=define
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g -O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    ccversion='', gccversion='4.8.1 20130909 [gcc-4_8-branch revision 202388]', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8
    alignbytes=16, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags ='-g -fstack-protector -fPIC'
    libpth=/usr/lib64 /lib64
    libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.18.so, so=so, useshrplib=true, libperl=libperl-5.16.3.so
    gnulibc_version='2.18'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -g -O2 -fstack-protector -fPIC'

Locally applied patches:
    


@INC for perl 5.16.3:
    /home/law/bin/lib
    /home/perl/perl-5.16.3/lib/site/x86_64-linux-thread-multi-ld
    /home/perl/perl-5.16.3/lib/site
    /home/perl/perl-5.16.3/lib/x86_64-linux-thread-multi-ld
    /home/perl/perl-5.16.3/lib
    .


Environment for perl 5.16.3:
    HOME=/home/law
    LANG (unset)
    LANGUAGE (unset)
    LC_COLLATE=C
    LC_CTYPE=en_US.UTF-8
    LC_MESSAGES=C
    LC_MONETARY=C
    LC_NUMERIC=C
    LC_TIME=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/perl/perl-5.24/usr/bin:.:/sbin:/home/law/bin/lib:/home/law/bin:/usr/local/bin:/usr/bin:/bin:/opt/kde3/bin:/usr/sbin:/etc/local/func_lib:/home/law/lib
    PERL5OPT=-Mutf8 -CSA -I/home/law/bin/lib
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented May 7, 2018

From @Grinnz

A couple of things​:

1. "Output the same bytes as on input." Nothing in Perl prevents this from occurring, but it's impossible to perform character-aware operations (like matching \w against unicode word characters) without knowing what encoding the decode the input from.

2. "use utf8;" only affects the source code itself. It's very different to talk about Perl's treatment of the bytes in the source code, and Perl's treatment of input and output bytes. Other operations are required to translate UTF-8 encoding at STDIN/STDOUT/STDERR, ARGV, and opened filehandle boundaries, among other things. These three things are covered by -CSAD. See https://metacpan.org/pod/perlrun#-C-[number/list]

3. I disagree with the feasability of any of the presented heuristics. It's 100% possible for a single-byte encoded file to look like UTF-8.

4. Using the locale to set default utf8 layers was a failed experiment in (I believe) Perl 5.8.0. You can enable this behavior for yourself with -CSADL (or adding L to your other -C switch arguments, see above link).

5. A potential way forward to at least default to the behavior of 'use utf8;' (decoding source code as UTF-8) was previously discussed in https://www.nntp.perl.org/group/perl.perl5.porters/2017/10/msg246838.html - I don't think there's any reasonable path to defaulting other handles to set utf8 layers.

@p5pRT
Copy link
Author

p5pRT commented May 7, 2018

The RT System itself - Status changed from 'new' to 'open'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants