Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 scripts with BOM not auto-detected #15960

Open
p5pRT opened this issue Apr 23, 2017 · 9 comments
Open

UTF-8 scripts with BOM not auto-detected #15960

p5pRT opened this issue Apr 23, 2017 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Apr 23, 2017

Migrated from rt.perl.org#131195 (status was 'open')

Searchable as RT131195$

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @jimav

(I can't seem to send purely plain text email, so I'm sending the
perlbug file as an attachment to avoid undesired line wraps)

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @jimav

This is a bug report for perl from jim.avera@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.1.


According to perlunicode(1)​:
  "... if a Perl script begins with the Unicode "BOM" (UTF-16LE, UTF16-BE,
  or UTF-8), or if the script looks like non-"BOM"-marked UTF-16 of either
  endianness, Perl will correctly read in the script as the appropriate
  Unicode encoding.

That is true for UTF-16 variants, but not UTF-8.

#!/usr/bin/perl
#
# Test to see if perl can auto-detect script encodings from a BOM
# (it's best to view the output on a utf-8 terminal)
#
use strict; use warnings;

# Do everything in a temporary directory
my $tdir = "/tmp/test.dir"; system "set -x; rm -rf $tdir; mkdir $tdir";
chdir $tdir || die;

# Some Perl source code which uses Unicode in identifiers and strings
my $sourcecode = <<EOF;
  use strict; use warnings;
  my \$\N{U+0444}\N{U+043E}\N{U+043E} = 42; # \$фоо = 42;
  my \$\N{U+041E}\N{U+0442}\N{U+0440}\N{U+043E} = "\N{U+2169}\N{U+216C}\N{U+2161}"; # \$Отро = "XLII";

  use open '​:std', '​:encoding(utf8)';

  print "ABC"
  ."\N{LEFT-POINTING DOUBLE ANGLE QUOTATION MARK}"
  ."DEF"
  ."\N{RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK}"
  ."GHI\\n";

  print "The anser is \$\N{U+0444}\N{U+043E}\N{U+043E} (\$\N{U+041E}\N{U+0442}\N{U+0440}\N{U+043E})\\n";

  exit \$\N{U+0444}\N{U+043E}\N{U+043E};
EOF

# Write out the perl script in various encodings, preceeded by
# the BOM character.
#
# According to perlunicode(1)​:
# "... if a Perl script begins with the Unicode "BOM" (UTF-16LE, UTF16-BE,
# or UTF-8), or if the script looks like non-"BOM"-marked UTF-16 of either
# endianness, Perl will correctly read in the script as the appropriate
# Unicode encoding.
#
for ('UTF-8', 'UTF-16LE', 'UTF-16BE',
  'UTF-16LE-nobom', 'UTF-16BE-nobom',
  'UTF-32LE', 'UTF-32BE')
{
  print "=================================================\n";
  my $enc = $_;
  my $nobom = $enc =~ s/-nobom$//;
  my $path = "test_${_}.pl";
  open my $fh, ">​:encoding($enc)", $path or die;
  print $fh "\N{U+FEFF}" unless $nobom; # the BOM character
  print $fh $sourcecode;
  close $fh or die "write error ($!)";
  system "set -x; od -N 16 -t x1 $path";
  system "set -x; perl $path";
}



Flags​:
  category=core
  severity=low


Site configuration information for perl 5.22.1​:

Configured by Debian Project at Sun Mar 13 11​:54​:18 UTC 2016.

Summary of my perl5 (revision 5 version 22 subversion 1) configuration​:
 
  Platform​:
  osname=linux, osvers=3.16.0, archname=x86_64-linux-gnu-thread-multi
  uname='linux localhost 3.16.0 #1 smp debian 3.16.0 x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dcc=x86_64-linux-gnu-gcc -Dcpp=x86_64-linux-gnu-cpp -Dld=x86_64-linux-gnu-gcc -Dccflags=-DDEBIAN -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.22 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.22 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.22 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.22.1 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.22.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -dEs -Duseshrplib -Dlibperl=libperl.so.5.22.1'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='x86_64-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='5.3.1 20160311', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='x86_64-linux-gnu-gcc', ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/5/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=libc-2.21.so, so=so, useshrplib=true, libperl=libperl.so.5.22
  gnulibc_version='2.21'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector-strong'

Locally applied patches​:
  DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
  DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
  DEBPKG​:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
  DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @​INC directories.
  DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
  DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
  DEBPKG​:fixes/respect_umask - Respect umask during installation
  DEBPKG​:debian/writable_site_dirs - Set umask approproately for site install directories
  DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: set location of libperl.a under /usr/lib
  DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
  DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
  DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
  DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
  DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
  DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian
  DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
  DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038] http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option
  DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
  DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
  DEBPKG​:debian/squelch-locale-warnings - http​://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
  DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
  DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List packaged patches for 5.22.1-9 in patchlevel.h
  DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
  DEBPKG​:fixes/document_makemaker_ccflags - http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
  DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479 Configure CPAN​::Distribution with correct name of html2text
  DEBPKG​:debian/perl5db-x-terminal-emulator.patch - http​://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
  DEBPKG​:debian/cpan-missing-site-dirs - http​://bugs.debian.org/688842 Fix CPAN​::FirstTime defaults with nonexisting site dirs if a parent is writable
  DEBPKG​:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http​://bugs.debian.org/587650 Memoize​::Storable​: respect 'nstore' option not respected
  DEBPKG​:debian/regen-skip - Skip a regeneration check in unrelated git repositories
  DEBPKG​:debian/makemaker-pasthru - http​://bugs.debian.org/758471 Pass LD settings through to subdirectories
  DEBPKG​:fixes/pod_man_reproducible_date - http​://bugs.debian.org/759405 Support POD_MAN_DATE in Pod​::Man for the left-hand footer
  DEBPKG​:debian/locale-robustness - http​://bugs.debian.org/782068 [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL
  DEBPKG​:fixes/podman-utc - http​://bugs.debian.org/780259 Make the embedded date from Pod​::Man reproducible
  DEBPKG​:fixes/podman-utc-docs - http​://bugs.debian.org/780259 Documentation and test suite updates for UTC fix
  DEBPKG​:fixes/podman-empty-date - http​://bugs.debian.org/780259 Support an empty POD_MAN_DATE environment variable
  DEBPKG​:fixes/podman-pipe - http​://bugs.debian.org/777405 Better errors for man pages from standard input
  DEBPKG​:debian/pod2man-customized - Update porting/customized.dat for pod2man modifications
  DEBPKG​:debian/makemaker-manext - http​://bugs.debian.org/247370 Make EU​::MakeMaker honour MANnEXT settings in generated manpage headers
  DEBPKG​:debian/makemaker_customized - Update t/porting/customized.dat for files patched in Debian
  DEBPKG​:debian/do-not-record-build-date - [6baa8db] http​://bugs.debian.org/774422 [perl #125830] Allow overriding the compile time in "perl -V" output
  DEBPKG​:fixes/podman-source-date-epoch - http​://bugs.debian.org/801621 Make Pod​::Man honor the SOURCE_DATE_EPOCH environment variable
  DEBPKG​:fixes/podman-source-date-epoch-cleanups - http​://bugs.debian.org/801621 Coding style and documentation for SOURCE_EPOCH_DATE
  DEBPKG​:fixes/podman-source-date-epoch-testfix - http​://bugs.debian.org/807086 Guard for building with SOURCE_DATE_EPOCH or POD_MAN_DATE set
  DEBPKG​:debian/devel-ppport-reproducibility - http​://bugs.debian.org/801523 Sort the list of XS code files when generating RealPPPort.xs
  DEBPKG​:fixes/encode-unicode-bom - http​://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
  DEBPKG​:debian/encode-unicode-bom-doc - http​://bugs.debian.org/798727 Document Debian backport of Encode​::Unicode fix
  DEBPKG​:debian/kfreebsd-softupdates - http​://bugs.debian.org/796798 Work around Debian Bug#796798
  DEBPKG​:fixes/autodie-scope - http​://bugs.debian.org/798096 Fix a scoping issue with "no autodie" and the "system" sub
  DEBPKG​:debian/debugperl-compat-fix - [perl #127212] http​://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
  DEBPKG​:fixes/CVE-2015-8607_file_spec_taint_fix - http​://bugs.debian.org/810719 [perl #126862] ensure File​::Spec​::canonpath() preserves taint
  DEBPKG​:fixes/mkstemp-umask - http​://bugs.debian.org/810924 [perl #127322] [e57270b] Fix umask for mkstemp(3) calls
  DEBPKG​:fixes/crosscompile-no-targethost - [perl #127234] Fix the Configure escape with usecrosscompile but no targethost
  DEBPKG​:fixes/podlators-no-encode - [rt.cpan.org #111156] Degrade gracefully if utf8 is requested but Encode is not available
  DEBPKG​:debian/cross-time-hires - [rt.cpan.org #111391] Add an environment variable to skip running configuration probes
  DEBPKG​:fixes/encode-unicode-pod - Unicode.pm​: Fix POD error
  DEBPKG​:fixes/memoize-pod - [rt.cpan.org #89441] Fix POD errors in Memoize
  DEBPKG​:fixes/ok-pod - Added encoding for pod.
  DEBPKG​:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ


@​INC for perl 5.22.1​:
  /home/jima/perl5/lib/perl5/5.22.1/x86_64-linux-gnu-thread-multi
  /home/jima/perl5/lib/perl5/5.22.1
  /home/jima/perl5/lib/perl5/x86_64-linux-gnu-thread-multi
  /home/jima/perl5/lib/perl5
  /home/jima/lib/perl
  /etc/perl
  /usr/local/lib/x86_64-linux-gnu/perl/5.22.1
  /usr/local/share/perl/5.22.1
  /usr/lib/x86_64-linux-gnu/perl5/5.22
  /usr/share/perl5
  /usr/lib/x86_64-linux-gnu/perl/5.22
  /usr/share/perl/5.22
  /usr/local/lib/site_perl
  /usr/lib/x86_64-linux-gnu/perl-base
  .


Environment for perl 5.22.1​:
  HOME=/home/jima
  LANG=en_US.UTF-8
  LANGUAGE=en_US
  LC_COLLATE=C
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/jima/perl5/bin​:/home/jima/bin​:/home/jima/jima_tools/x86_64/bin​:/home/jima/jima_tools/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/bin/X11​:/usr/local/bin​:/usr/local/sbin​:/usr/games​:/usr/local/games​:/usr/lib/jvm/java-8-oracle/bin​:/usr/lib/jvm/java-8-oracle/db/bin​:/usr/lib/jvm/java-8-oracle/jre/bin​:.
  PERL5LIB=/home/jima/perl5/lib/perl5​:/home/jima/lib/perl
  PERL_BADLANG (unset)
  PERL_LOCAL_LIB_ROOT=/home/jima/perl5
  PERL_MB_OPT=--install_base "/home/jima/perl5"
  PERL_MM_OPT=INSTALL_BASE=/home/jima/perl5
  SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @mauke

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by
# Please include the string​: [perl #131195]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​:
"... if a Perl script begins with the Unicode "BOM" (UTF-16LE, UTF16-BE, > or UTF-8), or if the script looks like non-"BOM"-marked UTF-16 of
either
endianness, Perl will correctly read in the script as the appropriate
Unicode encoding.

That is true for UTF-16 variants, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

--
Lukas Mai <plokinom@​gmail.com>

From perl5-porters-return-244160-rt-listener=rtperl.dev@​perl.org Sun Apr 23 02​:27​:20 2017
Return-Path​: <perl5-porters-return-244160-rt-listener=rtperl.dev@​perl.org>
X-Original-To​: rt-listener@​rtperl.dev
Delivered-To​: rt-listener@​rtperl.dev
Received​: from x6.develooper.com (x6.dev [10.0.100.16])
  by rtperl.develooper.com (Postfix) with ESMTP id 8AA8B1FD
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 02​:27​:20 -0700 (PDT)
Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])
  by x6.develooper.com (Postfix) with SMTP id 0F0701FB5
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 02​:27​:19 -0700 (PDT)
Received​: (qmail 6748 invoked by uid 514); 23 Apr 2017 09​:27​:15 -0000
Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm
list-help​: <mailto​:perl5-porters-help@​perl.org>
list-unsubscribe​: <mailto​:perl5-porters-unsubscribe@​perl.org>
list-post​: <mailto​:perl5-porters@​perl.org>
X-List-Archive​: <http​://nntp.perl.org/group/perl.perl5.porters/244160>
List-Id​: <perl5-porters.perl.org>
Delivered-To​: mailing list perl5-porters@​perl.org
Received​: (qmail 6732 invoked from network); 23 Apr 2017 09​:27​:14 -0000
X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com
X-Spam-Status​: No, score=-1.5 required=6.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM
autolearn=no version=3.3.1
DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;
s=20161025;
h=mime-version​:in-reply-to​:references​:from​:date​:message-id​:subject​:to :cc;
bh=yfKFy3lvVy47D3oZT/T5VCoPVtEmFMA94YJVSgHBmMI=;
b=UtcgES9npoPzWGnrYX9BKikoBjkIwl1XJqkta6MDEHIGiLbRfp8dfhUl0g2PSy0+1m
u6uiFAAeKGvdyLp4uZyhMsWbD/rUBIY2qqz30B37zyKErsw8v3a4g/oc8oVSt8e7gmvA
ns2VWlrn4O7+YvQsFrlHHg0gQyWNmqAsy2h/qPLZO6lOq9zRmfmPatSVym8Df0R+4oJK
Aa2b5pzIMy722qRLE+AoNONISEFldF8wq6M3GDP8n/TBJJwvLfu7GRqslAWar88KjMC0
Jm6uNS5kjDBTgSA/Bq2DwHG9XC0rabX9tgX/DT0miNlgOBV+8ASIgPrTLOHk4GKqZBJl zxqQ==
X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
s=20161025;
h=x-gm-message-state​:mime-version​:in-reply-to​:references​:from​:date
:message-id​:subject​:to​:cc; bh=yfKFy3lvVy47D3oZT/T5VCoPVtEmFMA94YJVSgHBmMI=;
b=ovDgZ23LdhojvhiGxRLlL5XSpT1emiXnaoaNZ8zAqy6Iv6AA5+333n4BL/P6NR4iRw
kEUjK/+PA/fJNCxM1ZxJfclcaqc2NiVtwQ04DkY/8ifDLRBJ4aOkzM44Nb0IJSNJpkUh
kl5eCvv6/YE+aV1cXEQcdh8BpPUaVhDODK6khIn2SJhalXlzToe/Z5qubTPQ629szo49
ngwuGjq4gHzP0ddk691z0uzZiUB+T4dlNnJW0q7EhTeOzyZwEHNeCVcMGYWvbeGtdwrj
bcnhj5zIOeMR3D3wuPrAD3RdmPxfzYzfjhpBiWSuH1YWh570cLtDwL2MvkRnUqdeTZo1 AqqA==
X-Gm-Message-State​: AN3rC/6ePWHx/3qnKMAceYK02xUo/iDqjQffu+PAyZsDtZLl8hLsvbqY
4ui7dhmqnQj4TFuJbyaTRUG8ImmTAg==
X-Received​: by 10.46.88.76 with SMTP id x12mr7444547ljd.90.1492939624726;
Sun, 23 Apr 2017 02​:27​:04 -0700 (PDT)
MIME-Version​: 1.0
In-Reply-To​: <20f9058a-1f90-7d8c-e736-79cb26038800@​gmail.com>
References​: <RT-Ticket-131195@​perl.org>
<f6ac076b-54e1-e4f8-e25a-537ccf974f4e@​gmail.com>
<rt-4.0.24-17912-1492909883-845.131195-75-0@​perl.org>
<20f9058a-1f90-7d8c-e736-79cb26038800@​gmail.com>
From​: demerphq <demerphq@​gmail.com>
Date​: Sun, 23 Apr 2017 11​:27​:04 +0200
Message-ID​:
<CANgJU+UFhDpzo1L=Qrq9DxM2+MivRa_Xg7COFEDae5x9bMt40w@​mail.gmail.com>
Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected
To​: Lukas Mai <plokinom@​gmail.com>
Cc​: Perl5 Porteros <perl5-porters@​perl.org>
Content-Type​: text/plain; charset=UTF-8
X-PMX-Version​: 5.6.1.2065439, Antispam-Engine​: 2.7.2.376379, Antispam-Data​:
2017.4.23.91816
X-PMX-Spam​: Gauge=IIIIIIII, Probability=8%, Report=' FROM_NAME_ONE_WORD 0.05,
HTML_00_01 0.05, HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0,
BODY_SIZE_1800_1899 0, BODY_SIZE_2000_LESS 0, BODY_SIZE_5000_LESS 0,
BODY_SIZE_7000_LESS 0, CT_TEXT_PLAIN_UTF8_CAPS 0, DKIM_SIGNATURE 0,
FROM_SAME_AS_TO_DOMAIN 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0,
REFERENCES 0, SPF_PASS 0, URI_ENDS_IN_HTML 0, URI_WITH_PATH_ONLY 0,
WEBMAIL_SOURCE 0, __ANY_URI 0, __BOUNCE_CHALLENGE_SUBJ 0,
__BOUNCE_NDR_SUBJ_EXEMPT 0, __CC_NAME 0, __CC_NAME_DIFF_FROM_ACC 0,
__CC_REAL_NAMES 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0,
__DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0, __FRAUD_BODY_WEBMAIL 0,
__FRAUD_WEBMAIL 0, __FRAUD_WEBMAIL_FROM 0, __FROM_DOMAIN_IN_ANY_TO1 0,
__FROM_DOMAIN_IN_RCPT 0, __FROM_GMAIL 0, __HAS_CC_HDR 0, __HAS_FROM 0,
__HAS_MSGID 0, __HELO_GMAIL 0, __HTTPS_URI 0, __IN_REP_TO 0, __MIME_TEXT_ONLY
0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MIME_VERSION 0, __MULTIPLE_URI_TEXT
0, __NO_HTML_TAG_RAW 0, __PHISH_SPEAR_HTTP_RECEIVED 0,
__PHISH_SPEAR_STRUCTURE_1 0, __PHISH_SPEAR_STRUCTURE_2 0, __RDNS_GMAIL 0,
__REFERENCES 0, __SANE_MSGID 0, __SUBJ_ALPHA_END 0, __SUBJ_ALPHA_NEGATE 0,
__TO_MALFORMED_2 0, __TO_NAME 0, __TO_NAME_DIFF_FROM_ACC 0, __TO_REAL_NAMES
0, __URI_IN_BODY 0, __URI_NOT_IMG 0, __URI_NO_WWW 0, __URI_NS ,
__URI_WITH_PATH 0, __YOUTUBE_RCVD 0, __zen.spamhaus.org_ERROR '
X-Original-Precedence​: bulk

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by
# Please include the string​: [perl #131195]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​:
"... if a Perl script begins with the Unicode "BOM" (UTF-16LE,
UTF16-BE, > or UTF-8), or if the script looks like non-"BOM"-marked
UTF-16 of

either

endianness, Perl will correctly read in the script as the appropriate
Unicode encoding.

That is true for UTF-16 variants, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

From perl5-porters-return-244161-rt-listener=rtperl.dev@​perl.org Sun Apr 23 03​:14​:59 2017
Return-Path​: <perl5-porters-return-244161-rt-listener=rtperl.dev@​perl.org>
X-Original-To​: rt-listener@​rtperl.dev
Delivered-To​: rt-listener@​rtperl.dev
Received​: from x6.develooper.com (x6.dev [10.0.100.16])
  by rtperl.develooper.com (Postfix) with ESMTP id DECE2119
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 03​:14​:58 -0700 (PDT)
Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])
  by x6.develooper.com (Postfix) with SMTP id 9DBFF24BC
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 03​:14​:55 -0700 (PDT)
Received​: (qmail 19068 invoked by uid 514); 23 Apr 2017 10​:14​:47 -0000
Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm
list-help​: <mailto​:perl5-porters-help@​perl.org>
list-unsubscribe​: <mailto​:perl5-porters-unsubscribe@​perl.org>
list-post​: <mailto​:perl5-porters@​perl.org>
X-List-Archive​: <http​://nntp.perl.org/group/perl.perl5.porters/244161>
List-Id​: <perl5-porters.perl.org>
Delivered-To​: mailing list perl5-porters@​perl.org
Received​: (qmail 19052 invoked from network); 23 Apr 2017 10​:14​:47 -0000
X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com
X-Spam-Status​: No, score=-2.0 required=6.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE autolearn=ham
version=3.3.1
DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;
s=20161025;
h=subject​:to​:references​:from​:message-id​:date​:user-agent​:mime-version
:in-reply-to​:content-language​:content-transfer-encoding;
bh=WCmSRh3bu4U0FvN74Vk3D18j+RakGhhcqHdCbXVchLw=;
b=klSmYutCwQmKTPr3TKiiGcnDkCQ7oytxETlMXqB4hKuHFEdIYbJnW6mcmFCvRKZ6Bv
zj4O/qa0+JEZqEGQesbWeVNLq3Fy2AYTl7yrLsJp653GFaiES7fvH4lq9OGhES4mB2OZ
/KkCyBwBBgD3PESHCIxk9umY0ohBCJkS4BdNv3wPYBsVzUiNpx5YBa2MOKJw4ogRRk6+
zJFD6Tm1mB0MinBclAkxiNXZ2qSqbQMiJqWDZ6cB/zc19zhznkaZxnr5zuEIUXO5n57E
Nxr9s+LPoJsMq2LH6SqW6w2zW8OrL5I6pX73ZJt81w1NuVYU+BXJQW+7UJbe7qIJ+POr 6Ryw==
X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
s=20161025; h=x-gm-message-state​:subject​:to​:references​:from​:message-id​:date
:user-agent​:mime-version​:in-reply-to​:content-language
:content-transfer-encoding; bh=WCmSRh3bu4U0FvN74Vk3D18j+RakGhhcqHdCbXVchLw=;
b=oyw9TNLbN+2rCmZa92v/IvXyxDevaRojukq00XLnwXB+rLqz/xgImsHkJj0vLpQQz4
TP1gLr333KLhq3+rt8slO2av2U56akYXU69YYd74cOkBdJYenp8oHO48AQwqkZ+Nn8X9
MSyCHGa2HN7ISYfkJqWt8/wsnN+7VemUU7SKhfjXB3Nb+KtXd0sIIWo0mp3gxVM9NkgK
x3ft7guVUBWt2mgDcxOutVL0BUKAWCozAJTWj5+5O7wiiXB2tKqMy1CtbPb9XNeDAmEk
tp+IRSEKctHoZehWyZ+Dw/Mchvija8PZ55MLUmQNeeMh1CEJYfgxCz0uP/Tbs5vysPvR I2bQ==
X-Gm-Message-State​: AN3rC/6bPo5wQqbrNeL1RqZr9NfZypjN/ZTLizXNmm4JwDwGAFWYGUMu
9rQMRlJweqDF0ld/
X-Received​: by 10.223.154.240 with SMTP id a103mr1422431wrc.5.1492942471455;
Sun, 23 Apr 2017 03​:14​:31 -0700 (PDT)
Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected
To​: Perl5 Porteros <perl5-porters@​perl.org>
References​: <RT-Ticket-131195@​perl.org>
<f6ac076b-54e1-e4f8-e25a-537ccf974f4e@​gmail.com>
<rt-4.0.24-17912-1492909883-845.131195-75-0@​perl.org>
<20f9058a-1f90-7d8c-e736-79cb26038800@​gmail.com>
<CANgJU+UFhDpzo1L=Qrq9DxM2+MivRa_Xg7COFEDae5x9bMt40w@​mail.gmail.com>
From​: Lukas Mai <plokinom@​gmail.com>
Message-ID​: <5769caa8-b752-7772-166c-57475242e30e@​gmail.com>
Date​: Sun, 23 Apr 2017 12​:14​:27 +0200
User-Agent​: Mozilla/5.0 (Windows NT 6.1; WOW64; rv​:52.0) Gecko/20100101
Thunderbird/52.0.1
MIME-Version​: 1.0
In-Reply-To​:
<CANgJU+UFhDpzo1L=Qrq9DxM2+MivRa_Xg7COFEDae5x9bMt40w@​mail.gmail.com>
Content-Type​: text/plain; charset=utf-8; format=flowed
Content-Language​: en-US
Content-Transfer-Encoding​: 7bit
X-PMX-Version​: 5.6.1.2065439, Antispam-Engine​: 2.7.2.376379, Antispam-Data​:
2017.4.23.100617
X-PMX-Spam​: Gauge=IIIIIIII, Probability=8%, Report=' HTML_00_01 0.05,
HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0, BODY_SIZE_1800_1899 0,
BODY_SIZE_2000_LESS 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0,
DKIM_SIGNATURE 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0, REFERENCES
0, SINGLE_URI_IN_BODY 0, SPF_PASS 0, URI_ENDS_IN_HTML 0, URI_WITH_PATH_ONLY
0, WEBMAIL_SOURCE 0, __ANY_URI 0, __BOUNCE_CHALLENGE_SUBJ 0,
__BOUNCE_NDR_SUBJ_EXEMPT 0, __CP_URI_IN_BODY 0, __CT 0, __CTE 0,
__CT_TEXT_PLAIN 0, __DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0,
__FRAUD_BODY_WEBMAIL 0, __FRAUD_MONEY_CURRENCY 0,
__FRAUD_MONEY_CURRENCY_DOLLAR 0, __FRAUD_WEBMAIL 0, __FRAUD_WEBMAIL_FROM 0,
__FROM_GMAIL 0, __HAS_FROM 0, __HAS_MSGID 0, __HELO_GMAIL 0, __HTTPS_URI 0,
__IN_REP_TO 0, __MIME_TEXT_ONLY 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0,
__MIME_VERSION 0, __MOZILLA_USER_AGENT 0, __NO_HTML_TAG_RAW 0,
__PHISH_SPEAR_STRUCTURE_1 0, __RDNS_GMAIL 0, __REFERENCES 0, __SANE_MSGID 0,
__SINGLE_URI_TEXT 0, __SUBJ_ALPHA_END 0, __SUBJ_ALPHA_NEGATE 0,
__TO_MALFORMED_2 0, __TO_NAME 0, __TO_NAME_DIFF_FROM_ACC 0, __TO_REAL_NAMES
0, __URI_IN_BODY 0, __URI_NOT_IMG 0, __URI_NO_WWW 0, __URI_NS ,
__URI_WITH_PATH 0, __USER_AGENT 0, __YOUTUBE_RCVD 0,
__blackholes.mail-abuse.org_ERROR , __zen.spamhaus.org_ERROR '
X-Original-Precedence​: bulk

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

The problem I'm worried about is that we already see problems from users
who write scripts on Windows (or copy them from somewhere in Windows
format), then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the
shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the
"#!" mechanism. That's why I think we shouldn't encourage it.

PS​: I like "Perl5 Porteros" :-)

--
Lukas Mai <plokinom@​gmail.com>

From perl5-porters-return-244162-rt-listener=rtperl.dev@​perl.org Sun Apr 23 04​:11​:23 2017
Return-Path​: <perl5-porters-return-244162-rt-listener=rtperl.dev@​perl.org>
X-Original-To​: rt-listener@​rtperl.dev
Delivered-To​: rt-listener@​rtperl.dev
Received​: from x6.develooper.com (x6.dev [10.0.100.16])
  by rtperl.develooper.com (Postfix) with ESMTP id DA39C314
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 04​:11​:23 -0700 (PDT)
Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])
  by x6.develooper.com (Postfix) with SMTP id 5170624BF
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 04​:11​:19 -0700 (PDT)
Received​: (qmail 29807 invoked by uid 514); 23 Apr 2017 11​:11​:15 -0000
Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm
list-help​: <mailto​:perl5-porters-help@​perl.org>
list-unsubscribe​: <mailto​:perl5-porters-unsubscribe@​perl.org>
list-post​: <mailto​:perl5-porters@​perl.org>
X-List-Archive​: <http​://nntp.perl.org/group/perl.perl5.porters/244162>
List-Id​: <perl5-porters.perl.org>
Delivered-To​: mailing list perl5-porters@​perl.org
Received​: (qmail 29791 invoked from network); 23 Apr 2017 11​:11​:14 -0000
X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com
X-Spam-Status​: No, score=-1.5 required=6.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM
autolearn=no version=3.3.1
DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;
s=20161025;
h=mime-version​:in-reply-to​:references​:from​:date​:message-id​:subject​:to :cc;
bh=4t92FQbOHrxO6HFKpwQsq5FUkyQgyohzFGeDpHZkXVY=;
b=q0LCvRI9S2M6DSXdAH/odn21T3lU5eSlxdiiV1V74a3RgnGh+Cm6K/SFjgdAyG1T/X
N6oHN9z5/5F+kaE4kN7TeMREfyGvgnQk6n1qyOHmLwDA+2xgmjkz5cmt0dkmbDzkcd9J
cRM3N7pdIAm1PpaQUaO8tJPykv10wfGzf8gyN79pRrFfvcPbbqzEPOgger2tHzDndRWr
ZPKde1gIoG5Hp6sL6k+JiEBb91Bz/SsVjgcfp9SsVSQxuOUms90fPTtb6hEjYSF0yJCi
Nse+CqTRM0eHZ+skYuxCb2IcQCxsnchWy07xsT1djRNCr9Mh9NIRYlrdTytDEhGDTObJ C1hg==
X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
s=20161025;
h=x-gm-message-state​:mime-version​:in-reply-to​:references​:from​:date
:message-id​:subject​:to​:cc; bh=4t92FQbOHrxO6HFKpwQsq5FUkyQgyohzFGeDpHZkXVY=;
b=ke/6UvWPvQxb1N7kQhCCsTfRV3DPmM4UJQ3RrJ7feTv8KYp7IcqZanEePwgy3DRaXJ
jjXf3qZjlkjkbVK1P4G7BwVJO6jP23LO/K3xKki7s0BaYfvnZ6wXtDlvyhwtvpdTwl9Z
WsQOVWCAeNpy+jAKwMDDhn6bnSpI2lc6jj0hYObIH2/w0nRcROS0c4G9Zh9UgQ0/dYQv
RSlDrjCkbLMWlWg4XRiuaEmawt8bAJkCloUHC2IxqBhJ8P/JJfuKeiy4GYD1HJD3BKjx
AC9gu2dN5rfIJwcZzHRuL7W0F59zrfK93SX/8GdBWzGzPVAAUAY2eStijm0fwEN8PLxK ud6Q==
X-Gm-Message-State​: AN3rC/45l41g3N9geTBr4lpXhoEMiENgA69hdUpaRWGEdZXnFJ3KhfRh
eZ6NFe4iSrA6bdOaRW00fMklvOIQy2Yt
X-Received​: by 10.25.77.135 with SMTP id a129mr7615447lfb.143.1492945848942;
Sun, 23 Apr 2017 04​:10​:48 -0700 (PDT)
MIME-Version​: 1.0
In-Reply-To​: <5769caa8-b752-7772-166c-57475242e30e@​gmail.com>
References​: <RT-Ticket-131195@​perl.org>
<f6ac076b-54e1-e4f8-e25a-537ccf974f4e@​gmail.com>
<rt-4.0.24-17912-1492909883-845.131195-75-0@​perl.org>
<20f9058a-1f90-7d8c-e736-79cb26038800@​gmail.com>
<CANgJU+UFhDpzo1L=Qrq9DxM2+MivRa_Xg7COFEDae5x9bMt40w@​mail.gmail.com>
<5769caa8-b752-7772-166c-57475242e30e@​gmail.com>
From​: demerphq <demerphq@​gmail.com>
Date​: Sun, 23 Apr 2017 13​:10​:48 +0200
Message-ID​:
<CANgJU+XchTRErFK9EzX1Tm_cVAFPqK7RPAsck1H0pmZu4VmFLg@​mail.gmail.com>
Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected
To​: Lukas Mai <plokinom@​gmail.com>
Cc​: Perl5 Porteros <perl5-porters@​perl.org>
Content-Type​: text/plain; charset=UTF-8
X-PMX-Version​: 5.6.1.2065439, Antispam-Engine​: 2.7.2.376379, Antispam-Data​:
2017.4.23.31816
X-PMX-Spam​: Gauge=IIIIIIII, Probability=8%, Report=' FROM_NAME_ONE_WORD 0.05,
HTML_00_01 0.05, HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0,
BODY_SIZE_2000_2999 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0,
CT_TEXT_PLAIN_UTF8_CAPS 0, DKIM_SIGNATURE 0, FROM_SAME_AS_TO_DOMAIN 0,
IN_REP_TO 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0, REFERENCES 0,
SINGLE_URI_IN_BODY 0, SPF_PASS 0, URI_ENDS_IN_HTML 0, URI_WITH_PATH_ONLY 0,
WEBMAIL_SOURCE 0, __ANY_URI 0, __BOUNCE_CHALLENGE_SUBJ 0,
__BOUNCE_NDR_SUBJ_EXEMPT 0, __CC_NAME 0, __CC_NAME_DIFF_FROM_ACC 0,
__CC_REAL_NAMES 0, __CP_URI_IN_BODY 0, __CT 0, __CT_TEXT_PLAIN 0,
__DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0, __FRAUD_BODY_WEBMAIL 0,
__FRAUD_MONEY_CURRENCY 0, __FRAUD_MONEY_CURRENCY_DOLLAR 0, __FRAUD_WEBMAIL 0,
__FRAUD_WEBMAIL_FROM 0, __FROM_DOMAIN_IN_ANY_TO1 0, __FROM_DOMAIN_IN_RCPT 0,
__FROM_GMAIL 0, __HAS_CC_HDR 0, __HAS_FROM 0, __HAS_MSGID 0, __HELO_GMAIL 0,
__HTTPS_URI 0, __IN_REP_TO 0, __MIME_TEXT_ONLY 0, __MIME_TEXT_P 0,
__MIME_TEXT_P1 0, __MIME_VERSION 0, __NO_HTML_TAG_RAW 0,
__PHISH_SPEAR_HTTP_RECEIVED 0, __PHISH_SPEAR_STRUCTURE_1 0,
__PHISH_SPEAR_STRUCTURE_2 0, __RDNS_GMAIL 0, __REFERENCES 0, __SANE_MSGID 0,
__SINGLE_URI_TEXT 0, __SUBJ_ALPHA_END 0, __SUBJ_ALPHA_NEGATE 0,
__TO_MALFORMED_2 0, __TO_NAME 0, __TO_NAME_DIFF_FROM_ACC 0, __TO_REAL_NAMES
0, __URI_IN_BODY 0, __URI_NOT_IMG 0, __URI_NO_WWW 0, __URI_NS ,
__URI_WITH_PATH 0, __YOUTUBE_RCVD 0, __zen.spamhaus.org_ERROR '
X-Original-Precedence​: bulk

On 23 April 2017 at 12​:14, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

The problem I'm worried about is that we already see problems from users who
write scripts on Windows (or copy them from somewhere in Windows format),
then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang
line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the "#!"
mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well,
and not inconvenience our users." I mean if we see the \r maybe we
should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes
are of that irritating type where Perl knows what is wrong, and could
do something reasonable, but doesn't.

PS​: I like "Perl5 Porteros" :-)

I think that was the name someone had given it who I replied to first
on list. Gmail remembered it, and despite a few lazy attempts to fix
it gmail has stubbornly refused to use anything else. I gave up caring
after a while. :-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

From perl5-porters-return-244163-rt-listener=rtperl.dev@​perl.org Sun Apr 23 05​:01​:22 2017
Return-Path​: <perl5-porters-return-244163-rt-listener=rtperl.dev@​perl.org>
X-Original-To​: rt-listener@​rtperl.dev
Delivered-To​: rt-listener@​rtperl.dev
Received​: from x6.develooper.com (x6.dev [10.0.100.16])
  by rtperl.develooper.com (Postfix) with ESMTP id C1F313B8
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 05​:01​:22 -0700 (PDT)
Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])
  by x6.develooper.com (Postfix) with SMTP id 45AC52488
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 05​:01​:21 -0700 (PDT)
Received​: (qmail 21874 invoked by uid 514); 23 Apr 2017 12​:01​:12 -0000
Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm
list-help​: <mailto​:perl5-porters-help@​perl.org>
list-unsubscribe​: <mailto​:perl5-porters-unsubscribe@​perl.org>
list-post​: <mailto​:perl5-porters@​perl.org>
X-List-Archive​: <http​://nntp.perl.org/group/perl.perl5.porters/244163>
List-Id​: <perl5-porters.perl.org>
Delivered-To​: mailing list perl5-porters@​perl.org
Received​: (qmail 21833 invoked from network); 23 Apr 2017 12​:00​:56 -0000
X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com
X-Spam-Status​: No, score=-2.0 required=6.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE autolearn=ham
version=3.3.1
DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;
s=20161025;
h=subject​:to​:references​:from​:message-id​:date​:user-agent​:mime-version
:in-reply-to​:content-language​:content-transfer-encoding;
bh=GBkaaOMz2gK0D2Vk6+uuBo5abhsvcrvefQKRkie+WcI=;
b=FwwctEvNT9hQhMttkalc1kNsxDgbIbJVphGWt4mGg0VbnlG2FUm2pVCMTjFehoFh2T
50oZpkHkCqKkRK84+R4AHxhlD8GEeYDqwK/njNjH09ZYZ+ROuZuqZGdRA49yAM5Nf9th
cpYnhLkSGW5dse3DjAt6PcSncufk7785J/F5b66E+qjNPSY3Fr1a8TONBxl1Cmdp8r/t
kb+GepWFYdTzOlrLcfVT9n8CE2hwi3hUzS2FQ3JNFUPeyO2o8X4y/kMIRhNXaIdOAP1+
/xZpl4SIOI0iLqw8varR2VX1jCd2gOtWrgXg8Q1ISPQEJZPWjvT+/1RfquyZo6mIyGjj fDdw==
X-Google-DKIM-Signature​: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
s=20161025; h=x-gm-message-state​:subject​:to​:references​:from​:message-id​:date
:user-agent​:mime-version​:in-reply-to​:content-language
:content-transfer-encoding; bh=GBkaaOMz2gK0D2Vk6+uuBo5abhsvcrvefQKRkie+WcI=;
b=AQ6gDgHfjP7t1UZxe3w8D77upE/shbGT1lDxcL6a/D9dotAJ/A0F0rzIJ56sfvGafT
PO0G0X+bJ14+Gyijrb4xkGpywQdUXGo9bmJktj7PGkdteWUMU30mOupMt4ABKMS5MjvC
K4XUcgspsH+N+nO1kD/CMtBslgcp5ZJ65oYHcUnQFfW9wzS88sJXEQU9IMInOtPP5l0/
PfIJVcSDOmV/DZ5A+8IYOM6sM7z/04NtMcXaZBuvZdtBzzHpL0VF0wYzfplnWcix8SwT
/bprbwIDCUwZ/yPbeYTo0F0y2pIUARpkb8TjfLtLrqKC3U9KSudlsviFm7wPAp2u5kCR 2I6w==
X-Gm-Message-State​: AN3rC/6OprAHdIRYMikfkDIyEctUQo9UiKvjHu+fE8eV0sud+lICO+ed
4RkmmOzADaG0EA==
X-Received​: by 10.223.136.235 with SMTP id g40mr1777803wrg.107.1492948816671;
Sun, 23 Apr 2017 05​:00​:16 -0700 (PDT)
Subject​: Re​: [perl #131195] UTF-8 scripts with BOM not auto-detected
To​: Perl5 Porteros <perl5-porters@​perl.org>
References​: <RT-Ticket-131195@​perl.org>
<f6ac076b-54e1-e4f8-e25a-537ccf974f4e@​gmail.com>
<rt-4.0.24-17912-1492909883-845.131195-75-0@​perl.org>
<20f9058a-1f90-7d8c-e736-79cb26038800@​gmail.com>
<CANgJU+UFhDpzo1L=Qrq9DxM2+MivRa_Xg7COFEDae5x9bMt40w@​mail.gmail.com>
<5769caa8-b752-7772-166c-57475242e30e@​gmail.com>
<CANgJU+XchTRErFK9EzX1Tm_cVAFPqK7RPAsck1H0pmZu4VmFLg@​mail.gmail.com>
From​: Lukas Mai <plokinom@​gmail.com>
Message-ID​: <c4ebf403-15c0-1b98-df5e-4bc0718ba561@​gmail.com>
Date​: Sun, 23 Apr 2017 14​:00​:12 +0200
User-Agent​: Mozilla/5.0 (Windows NT 6.1; WOW64; rv​:52.0) Gecko/20100101
Thunderbird/52.0.1
MIME-Version​: 1.0
In-Reply-To​:
<CANgJU+XchTRErFK9EzX1Tm_cVAFPqK7RPAsck1H0pmZu4VmFLg@​mail.gmail.com>
Content-Type​: text/plain; charset=utf-8; format=flowed
Content-Language​: en-US
Content-Transfer-Encoding​: 7bit
X-PMX-Version​: 5.6.1.2065439, Antispam-Engine​: 2.7.2.376379, Antispam-Data​:
2017.4.23.114516
X-PMX-Spam​: Gauge=IIIIIIII, Probability=8%, Report=' HTML_00_01 0.05,
HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0, BODY_SIZE_1800_1899 0,
BODY_SIZE_2000_LESS 0, BODY_SIZE_5000_LESS 0, BODY_SIZE_7000_LESS 0,
DKIM_SIGNATURE 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0, NO_URI_HTTPS
0, REFERENCES 0, SINGLE_URI_IN_BODY 0, SPF_PASS 0, URI_WITH_PATH_ONLY 0,
WEBMAIL_SOURCE 0, __ANY_URI 0, __BOUNCE_CHALLENGE_SUBJ 0,
__BOUNCE_NDR_SUBJ_EXEMPT 0, __CP_URI_IN_BODY 0, __CT 0, __CTE 0,
__CT_TEXT_PLAIN 0, __DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0,
__FRAUD_BODY_WEBMAIL 0, __FRAUD_MONEY_CURRENCY 0,
__FRAUD_MONEY_CURRENCY_DOLLAR 0, __FRAUD_WEBMAIL 0, __FRAUD_WEBMAIL_FROM 0,
__FROM_GMAIL 0, __HAS_FROM 0, __HAS_MSGID 0, __HELO_GMAIL 0, __IN_REP_TO 0,
__MIME_TEXT_ONLY 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MIME_VERSION 0,
__MOZILLA_USER_AGENT 0, __NO_HTML_TAG_RAW 0, __PHISH_SPEAR_STRUCTURE_1 0,
__RDNS_GMAIL 0, __REFERENCES 0, __SANE_MSGID 0, __SINGLE_URI_TEXT 0,
__SUBJ_ALPHA_END 0, __SUBJ_ALPHA_NEGATE 0, __TO_MALFORMED_2 0, __TO_NAME 0,
__TO_NAME_DIFF_FROM_ACC 0, __TO_REAL_NAMES 0, __URI_IN_BODY 0, __URI_NOT_IMG
0, __URI_NS , __URI_WITH_PATH 0, __USER_AGENT 0, __YOUTUBE_RCVD 0,
__blackholes.mail-abuse.org_ERROR , __zen.spamhaus.org_ERROR '
X-Original-Precedence​: bulk

Am 23.04.2017 um 13​:10 schrieb demerphq​:

On 23 April 2017 at 12​:14, Lukas Mai <plokinom@​gmail.com> wrote​:

The problem I'm worried about is that we already see problems from users who
write scripts on Windows (or copy them from somewhere in Windows format),
then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang
line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the "#!"
mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well,
and not inconvenience our users." I mean if we see the \r maybe we
should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes
are of that irritating type where Perl knows what is wrong, and could
do something reasonable, but doesn't.

If you want to make that work, you have to go out and patch all unixish
kernels. Perl doesn't even run because there is no file called
"/usr/bin/perl\r" on the system.

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'`
as part of the install step, but ... eugh.)

But even that won't help you with a BOM​: Either it will fail outright
(unknown executable format (not ELF, doesn't start with "#!")) or the
shell will "helpfully" try to run it as a shell script. That's why
http​://www.unicode.org/faq/utf_bom.html#bom10 says "Some byte oriented
protocols expect ASCII characters at the beginning of a file. If UTF-8
is used with these protocols, use of the BOM as encoding form signature
should be avoided."

--
Lukas Mai <plokinom@​gmail.com>

From perl5-porters-return-244165-rt-listener=rtperl.dev@​perl.org Sun Apr 23 12​:33​:30 2017
Return-Path​: <perl5-porters-return-244165-rt-listener=rtperl.dev@​perl.org>
X-Original-To​: rt-listener@​rtperl.dev
Delivered-To​: rt-listener@​rtperl.dev
Received​: from x6.develooper.com (x6.dev [10.0.100.16])
  by rtperl.develooper.com (Postfix) with ESMTP id 251B11FD
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 12​:33​:30 -0700 (PDT)
Received​: from lists-nntp.develooper.com (localhost.localdomain [127.0.0.1])
  by x6.develooper.com (Postfix) with SMTP id F1BB92400
  for <rt-listener@​rtperl.dev>; Sun, 23 Apr 2017 12​:33​:28 -0700 (PDT)
Received​: (qmail 25603 invoked by uid 514); 23 Apr 2017 19​:33​:23 -0000
Mailing-List​: contact perl5-porters-help@​perl.org; run by ezmlm
list-help​: <mailto​:perl5-porters-help@​perl.org>
list-unsubscribe​: <mailto​:perl5-porters-unsubscribe@​perl.org>
list-post​: <mailto​:perl5-porters@​perl.org>
X-List-Archive​: <http​://nntp.perl.org/group/perl.perl5.porters/244165>
List-Id​: <perl5-porters.perl.org>
Delivered-To​: mailing list perl5-porters@​perl.org
Received​: (qmail 25587 invoked from network); 23 Apr 2017 19​:33​:23 -0000
X-Spam-Checker-Version​: SpamAssassin 3.3.1 (2010-03-16) on mx3.develooper.com
X-Spam-Status​: No, score=-2.0 required=6.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS autolearn=ham
version=3.3.1
DKIM-Signature​: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=fysh.org;
s=20170316;
h=In-Reply-To​:Content-Type​:MIME-Version​:References​:Message-ID​:Subject​:To​:From​:Date; bh=NxN4ecHpanfE49L8tG/CCOeckuMvb0fbWdyHjSjVpU8=; b=D+YjBWW2a8jD3+PE9JSNh5Zd/5J4if+htsJpskTg/z+xbCWtaXTmaNQwzHVBadQ3NrRN6Q5F+Q+ZcGa3yRAKaNmP1ex31Psb7488UOyksw4Y8vwcOLgcOKr511pPjVqcWrN2Pog8XZOd9tbtTwDUNddV3eCNnTN6AARJAwo2+8g=;
Date​: Sun, 23 Apr 2017 20​:33​:12 +0100
From​: Zefram <zefram@​fysh.org>
To​: perl5-porters@​perl.org
Subject​: Re​: [perl #131190] erroneous regex warning after utf8 conversion
Message-ID​: <20170423193312.GK6765@​fysh.org>
References​: <RT-Ticket-131190@​perl.org>
<58fa9233.8133620a.d8df3.9c89@​mx.google.com>
<rt-4.0.24-2143-1492816473-77.131190-75-0@​perl.org>
MIME-Version​: 1.0
Content-Type​: text/plain; charset=us-ascii
Content-Disposition​: inline
In-Reply-To​: <rt-4.0.24-2143-1492816473-77.131190-75-0@​perl.org>
X-PMX-Version​: 5.6.1.2065439, Antispam-Engine​: 2.7.2.376379, Antispam-Data​:
2017.4.23.192417
X-PMX-Spam​: Gauge=IIIIIIII, Probability=8%, Report=' FROM_NAME_ONE_WORD 0.05,
HTML_00_01 0.05, HTML_00_10 0.05, BODYTEXTP_SIZE_3000_LESS 0,
BODY_SIZE_1000_1099 0, BODY_SIZE_2000_LESS 0, BODY_SIZE_5000_LESS 0,
BODY_SIZE_7000_LESS 0, DKIM_SIGNATURE 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0,
MSG_THREAD 0, NO_CTA_URI_FOUND 0, NO_URI_FOUND 0, NO_URI_HTTPS 0, REFERENCES
0, SPF_PASS 0, __BOUNCE_CHALLENGE_SUBJ 0, __BOUNCE_NDR_SUBJ_EXEMPT 0, __CD 0,
__CT 0, __CT_TEXT_PLAIN 0, __FRAUD_MONEY_CURRENCY 0,
__FRAUD_MONEY_CURRENCY_DOLLAR 0, __HAS_FROM 0, __HAS_MSGID 0, __IN_REP_TO 0,
__MIME_TEXT_ONLY 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MIME_VERSION 0,
__NO_HTML_TAG_RAW 0, __PHISH_SPEAR_SUBJECT 0, __REFERENCES 0, __SANE_MSGID 0,
__SUBJ_ALPHA_END 0, __SUBJ_ALPHA_NEGATE 0, __TO_MALFORMED_2 0, __TO_NO_NAME
0, __zen.spamhaus.org_ERROR '
X-Original-Precedence​: bulk

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @demerphq

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 23.04.2017 um 03​:11 schrieb (via RT)​:

# New Ticket Created by
# Please include the string​: [perl #131195]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=131195 >

According to perlunicode(1)​:
"... if a Perl script begins with the Unicode "BOM" (UTF-16LE,
UTF16-BE, > or UTF-8), or if the script looks like non-"BOM"-marked
UTF-16 of

either

endianness, Perl will correctly read in the script as the appropriate
Unicode encoding.

That is true for UTF-16 variants, but not UTF-8.

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @mauke

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

The problem I'm worried about is that we already see problems from users
who write scripts on Windows (or copy them from somewhere in Windows
format), then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the
shebang line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the
"#!" mechanism. That's why I think we shouldn't encourage it.

PS​: I like "Perl5 Porteros" :-)

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @demerphq

On 23 April 2017 at 12​:14, Lukas Mai <plokinom@​gmail.com> wrote​:

Am 23.04.2017 um 11​:27 schrieb demerphq​:

On 23 April 2017 at 11​:13, Lukas Mai <plokinom@​gmail.com> wrote​:

Duplicate of https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121292 ?

If it is I will just say that i think this issue could be reopened. I
think that ticket was decided wrongly.

I think we should have respected the docs and added support for
utf8-bom's. Strictly speaking they are unrequired, but they are common
in Windows workflow, and I don't see what harm is caused by respecting
them as compared to respecting UTF-16 BOM's. As far as I can tell the
only difference is that with UTF16 BOM's are required to properly
discriminate UTF-16BE and UTF-16LE data, whereas utf8 strictly
speaking is endianness neutral. However, in windows it is traditional
to use BOM's to signal any format of unicode, so we force people using
utf8 on windows to scrub their BOM's. I never understood why,
especially since most people who object to this are on *nix platforms
where such BOM's almost never show up. (I remember getting bitten by
utf8 BOM's when I worked on Windows a lot, but have never seen a
utf8-BOM since I switched to *nix.)

Maybe we should re-enable this on Windows, or make it be a build option.

The problem I'm worried about is that we already see problems from users who
write scripts on Windows (or copy them from somewhere in Windows format),
then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang
line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the "#!"
mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well,
and not inconvenience our users." I mean if we see the \r maybe we
should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes
are of that irritating type where Perl knows what is wrong, and could
do something reasonable, but doesn't.

PS​: I like "Perl5 Porteros" :-)

I think that was the name someone had given it who I replied to first
on list. Gmail remembered it, and despite a few lazy attempts to fix
it gmail has stubbornly refused to use anything else. I gave up caring
after a while. :-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2017

From @mauke

Am 23.04.2017 um 13​:10 schrieb demerphq​:

On 23 April 2017 at 12​:14, Lukas Mai <plokinom@​gmail.com> wrote​:

The problem I'm worried about is that we already see problems from users who
write scripts on Windows (or copy them from somewhere in Windows format),
then run them on Unix, only to get​:

$ ./my_script.pl
./my_script.pl​: No such file or directory

when my_script.pl clearly exists. This failure mode is caused by the shebang
line containing an invisible carriage return​:

#!/usr/bin/perl\r

Similarly, an invisible BOM at the beginning would completely break the "#!"
mechanism. That's why I think we shouldn't encourage it.

Interesting. My response to that is "so lets make that work as well,
and not inconvenience our users." I mean if we see the \r maybe we
should just assume the file is in windows line endings and DTRT.

My point here is that it seems to me that most of these failure modes
are of that irritating type where Perl knows what is wrong, and could
do something reasonable, but doesn't.

If you want to make that work, you have to go out and patch all unixish
kernels. Perl doesn't even run because there is no file called
"/usr/bin/perl\r" on the system.

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'`
as part of the install step, but ... eugh.)

But even that won't help you with a BOM​: Either it will fail outright
(unknown executable format (not ELF, doesn't start with "#!")) or the
shell will "helpfully" try to run it as a shell script. That's why
http​://www.unicode.org/faq/utf_bom.html#bom10 says "Some byte oriented
protocols expect ASCII characters at the beginning of a file. If UTF-8
is used with these protocols, use of the BOM as encoding form signature
should be avoided."

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Author

p5pRT commented Apr 24, 2017

From @jimav

On Sun, 23 Apr 2017 14​:32​:33 -0700, plokinom@​gmail.com wrote​:

(I suppose you could fix that by doing `ln -s perl $'/usr/bin/perl\r'`
as part of the install step, but ... eugh.)
But even that won't help you with a BOM​: Either it will fail outright
(unknown executable format (not ELF, doesn't start with "#!")) or the
shell will "helpfully" try to run it as a shell script...

IMO, #! support is semi-off-topic. The problem at hand is that you can't say
  perl file.pl
and have it work if file.pl starts with a UTF-8 BOM. As noted by others,
you automatically get a BOM when saving a file in UTF-8 format on Windows.

I just dont' see how any harm could come to *nix users if Perl recognizes the BOM *and* acts accordingly
(right now perl recognizes the BOM but simply ignores it, and decodes the rest of the file incorrectly).

Deliberately making life harder for users, even (gasp) users on Windows, should be done only with very compelling reasons!

I don't think a file starting with a BOM could legitimately contain non-Unicode characters. If there is a BOM, the file was created by Unicode-aware software (e.g. Notepad), and absent a bug or major user shenanigans, the file is certain to contain Unicode characters encoded as the BOM indicates.

BTW, a BOM is not invisible if you look at the file with vim -b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants