Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] remove note about BOM from to 'use utf8' docs #13609

Open
p5pRT opened this issue Feb 18, 2014 · 27 comments
Open

[PATCH] remove note about BOM from to 'use utf8' docs #13609

p5pRT opened this issue Feb 18, 2014 · 27 comments
Labels

Comments

@p5pRT
Copy link

p5pRT commented Feb 18, 2014

Migrated from rt.perl.org#121269 (status was 'open')

Searchable as RT121269$

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From efimov@reg.ru

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

===
Summary of my perl5 (revision 5 version 14 subversion 2) configuration​:

  Platform​:
  osname=linux, osvers=2.6.42-37-generic,
archname=x86_64-linux-gnu-thread-multi
  uname='linux panlong 2.6.42-37-generic #58-ubuntu smp thu jan 24
15​:28​:10 utc 2013 x86_64 x86_64 x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN
-Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr
-Dprivlib=/usr/share/perl/5.14 -Darchlib=/usr/lib/perl/5.14
-Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5
-Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local
-Dsitelib=/usr/local/share/perl/5.14.2
-Dsitearch=/usr/local/lib/perl/5.14.2 -Dman1dir=/usr/share/man/man1
-Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1
-Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1
-Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh
-Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -DDEBUGGING=-g -Doptimize=-O2
-Duseshrplib -Dlibperl=libperl.so.5.14.2 -des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing
-pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.6.3', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=, so=so, useshrplib=true, libperl=libperl.so.5.14.2
  gnulibc_version='2.15'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib
-fstack-protector'

Characteristics of this binary (from libperl)​:
  Compile-time options​: MULTIPLICITY PERL_DONT_CREATE_GVSV
  PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
  PERL_PRESERVE_IVUV USE_64_BIT_ALL USE_64_BIT_INT
  USE_ITHREADS USE_LARGE_FILES USE_PERLIO USE_PERL_ATOF
  USE_REENTRANT_API
  Locally applied patches​:
  DEBPKG​:debian/arm_thread_stress_timeout -
http​://bugs.debian.org/501970 Raise the timeout of
ext/threads/shared/t/stress.t to accommodate slower build hosts
  DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS
default for modules installed from CPAN.
  DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove
overly restrictive DB_File version check.
  DEBPKG​:debian/doc_info - Replace generic man(1) instructions with
Debian-specific information.
  DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak
enc2xs to follow symlinks and ignore missing @​INC directories.
  DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove
Errno version check due to upgrade problems with long-running
processes.
  DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778
Note that libperl-dev package is required for embedded linking
  DEBPKG​:fixes/respect_umask - Respect umask during installation
  DEBPKG​:debian/writable_site_dirs - Set umask approproately for
site install directories
  DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: Set location of
libperl.a to /usr/lib
  DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or
perllocal.pod for perl or vendor
  DEBPKG​:debian/prefix_changes - Fiddle with *PREFIX and variables
written to the makefile
  DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to
the binary targets.
  DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install
.packlist files for core or vendor.
  DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH
as per Debian policy.
  DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to
/etc/perl/Net as /usr may not be writable.
  DEBPKG​:debian/m68k_thread_stress - http​://bugs.debian.org/517938
http​://bugs.debian.org/495826 Disable some threads tests on m68k for
now due to missing TLS.
  DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian
  DEBPKG​:debian/module_build_man_extensions -
http​://bugs.debian.org/479460 Adjust Module​::Build manual page
extensions for the Debian Perl policy
  DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the
list of libraries wanted to what we actually need.
  DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038]
http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option
  DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp
skip include directories in /usr/local
  DEBPKG​:debian/disable-zlib-bundling - Disable zlib bundling in
Compress​::Raw​::Zlib
  DEBPKG​:debian/cpanplus_definstalldirs -
http​://bugs.debian.org/533707 Configure CPANPLUS to use the site
directories by default.
  DEBPKG​:debian/cpanplus_config_path - Save local versions of
CPANPLUS​::Config​::System into /etc/perl.
  DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/580034
Point users to Debian packages of deprecated core modules
  DEBPKG​:fixes/hurd-ccflags - [a190e64]
http​://bugs.debian.org/587901 [perl #92244] Make hints/gnu.sh append
to $ccflags rather than overriding them
  DEBPKG​:debian/squelch-locale-warnings -
http​://bugs.debian.org/508764 Squelch locale warnings in Debian
package maintainer scripts
  DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the
upstream Git repository
  DEBPKG​:fixes/extutils-cbuilder-cflags - [011e8fb]
http​://bugs.debian.org/624460 [perl #89478] Append CFLAGS and LDFLAGS
to their Config.pm counterparts in EU​::CBuilder
  DEBPKG​:fixes/module-build-home-directory -
http​://bugs.debian.org/624850 [rt.cpan.org #67893] Fix failing tilde
test when run under a UID without a passwd entry
  DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List
packaged patches for 5.14.2-6ubuntu2.4 in patchlevel.h
  DEBPKG​:fixes/h2ph-multiarch - [e7ec705]
http​://bugs.debian.org/625808 [perl #90122] Make h2ph correctly search
gcc include directories
  DEBPKG​:fixes/index-tainting - [3b36395]
http​://bugs.debian.org/291450 [perl #64804] RT 64804​: tainting with
index() of a constant
  DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493
[perl #96272] Skip a crashing test case in t/op/threads.t on
GNU/kFreeBSD
  DEBPKG​:fixes/document_makemaker_ccflags -
http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that
CCFLAGS should include $Config{ccflags}
  DEBPKG​:fixes/sys-syslog-socket-timeout-kfreebsd.patch -
http​://bugs.debian.org/627821 [rt.cpan.org #69997] Use a socket
timeout on GNU/kFreeBSD to catch ICMP port unreachable messages
  DEBPKG​:fixes/hurd-hints - http​://bugs.debian.org/636609 Improve
general GNU hints, needed for GNU/Hurd.
  DEBPKG​:fixes/pod_fixes - [7698aed] http​://bugs.debian.org/637816
Fix typos in several pod/perl*.pod files
  DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479
Configure CPAN​::Distribution with correct name of html2text
  DEBPKG​:fixes/digest_eval_hole - http​://bugs.debian.org/644108
Close the eval "require $module" security hole in
Digest->new($algorithm)
  DEBPKG​:fixes/hurd-ndbm - [f0d0a20] [perl #102680]
http​://bugs.debian.org/645989 Add GNU/Hurd hints for NDBM_File
  DEBPKG​:fixes/sysconf.t-posix - [8040185] [perl #102888]
http​://bugs.debian.org/646016 Fix hang in ext/POSIX/t/sysconf.t on
GNU/Hurd
  DEBPKG​:fixes/hurd-largefile - [1fda587] [perl #103014]
http​://bugs.debian.org/645790 enable LFS on GNU/Hurd
  DEBPKG​:debian/hurd_test_todo_syslog -
http​://bugs.debian.org/650093 Disable failing GNU/Hurd tests in
cpan/Sys-Syslog/t/syslog.t
  DEBPKG​:fixes/hurd_skip_itimer_virtual - [rt.cpan.org #72754]
http​://bugs.debian.org/650094 Skip interval timer tests in Time​::HiRes
on GNU/Hurd
  DEBPKG​:debian/hurd_test_skip_socketpair -
http​://bugs.debian.org/650186 Disable failing GNU/Hurd tests
ext/Socket/t/socketpair.t
  DEBPKG​:debian/hurd_test_skip_sigdispatch -
http​://bugs.debian.org/650188 Disable failing GNU/Hurd tests
op/sigdispatch.t
  DEBPKG​:debian/hurd_test_skip_stack - http​://bugs.debian.org/650175
Disable failing GNU/Hurd tests dist/threads/t/stack.t
  DEBPKG​:debian/hurd_test_skip_recv - http​://bugs.debian.org/650095
Disable failing GNU/Hurd tests cpan/autodie/t/recv.t
  DEBPKG​:debian/hurd_test_skip_libc - http​://bugs.debian.org/650097
Disable failing GNU/Hurd tests dist/threads/t/libc.t
  DEBPKG​:debian/hurd_test_skip_pipe - http​://bugs.debian.org/650187
Disable failing GNU/Hurd tests io/pipe.t
  DEBPKG​:debian/hurd_test_skip_io_pipe -
http​://bugs.debian.org/650096 Disable failing GNU/Hurd tests
dist/IO/t/io_pipe.t
  DEBPKG​:fixes/CVE-2012-5195 - avoid calling memset with a negative count
  DEBPKG​:fixes/CVE-2012-5526 - [PATCH 1/4] CR escaping for P3P header
  DEBPKG​:CVE-2013-1667.patch - [PATCH] Prevent premature hsplit()
calls, and only trigger REHASH after hsplit()
  DEBPKG​:CVE-2012-6329.patch -
http​://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695224 [1735f6f] fix
arbitrary command execution via _compile function in Maketext.pm
  Built under linux
  Compiled at Feb 4 2014 23​:11​:19
  %ENV​:
  PERLBREW_BASHRC_VERSION="0.67"
  PERLBREW_HOME="/home/vse/.perlbrew"
  PERLBREW_MANPATH=""
  PERLBREW_PATH="/home/perlbrew/bin"
  PERLBREW_ROOT="/home/perlbrew"
  PERLBREW_VERSION="0.67"
  @​INC​:
  /etc/perl
  /usr/local/lib/perl/5.14.2
  /usr/local/share/perl/5.14.2
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.14
  /usr/share/perl/5.14
  /usr/local/lib/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2014

From efimov@reg.ru

0001-Removing-note-about-Byte-Order-Mark-from-utf8-docs.-.patch
From 54bcadfbceca8d8155745f4b0bf258737f5c6117 Mon Sep 17 00:00:00 2001
From: Victor Efimov <efimov@reg.ru>
Date: Tue, 18 Feb 2014 12:43:47 +0400
Subject: [PATCH] Removing note about Byte Order Mark from 'utf8' docs. UTF-8
 BOM does not seem to work as alternative to 'use utf8'.

---
 lib/utf8.pm |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/utf8.pm b/lib/utf8.pm
index 43c7277..67a57dc 100644
--- a/lib/utf8.pm
+++ b/lib/utf8.pm
@@ -57,8 +57,7 @@ script is written in UTF-8.> The utility functions described below are
 directly usable without C<use utf8;>.
 
 Because it is not possible to reliably tell UTF-8 from native 8 bit
-encodings, you need either a Byte Order Mark at the beginning of your
-source code, or C<use utf8;>, to instruct perl.
+encodings, you need C<use utf8;>, to instruct perl.
 
 When UTF-8 becomes the standard source format, this pragma will
 effectively become a no-op.  For convenience in what follows the term
-- 
1.7.9.5

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @tonycoz

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org> wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @Tux

On Wed, 19 Feb 2014 11​:48​:14 +0100, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org> wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

So do I

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @Hugmeir

On Wed, Feb 19, 2014 at 11​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

Agree, but easier said than done :) UTF-16 is supported by having our own
UTF-16 -> UTF-8 decoder in toke.c, which is insane, while "UTF-8" support
is really just skipping the bom, turning the flag on, and assuming that the
code is correctly encoded.

Some months back I tried replacing all of that mess with just an encoding
layer, but :crlf made things exceedingly complicated[0] -- once my computer
arrives over here, I'll rebase the branch and give it another shot, but
don't hold your breath.

[0] Scripts would actually need that turned *off* by default; after the
first line is read in binary mode, you can check for the bom, return that
line to the buffer, then apply the necessary layers, including crlf.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From efimov@reg.ru

2014-02-19 4​:46 GMT+04​:00 Tony Cook via RT <perlbug-followup@​perl.org>​:

A UTF-16 BOM is recognized though, and treats the source as unicode.

yes. indeed. starting from ~ 5.12.

I wonder if this should be documented in 'utf8' or no​:
- UTF-16 is not UTF-8
- there is no 'use' equivalent for UTF-16 BOM.
maybe perlrun is a better place?

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @ikegami

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From victor@vsespb.ru

On Wed Feb 19 02​:48​:40 2014, demerphq wrote​:

Personally I think if we support UTF-16 BOM we should support them
all.

Yves

I am opposed to this.

1) 'use utf8' behaviour change how program behaves. It's not just cosmetic/metadata.

2) How patches to add/remove BOM will look? How diff/git/github/IDEs/other tools display BOM ? Will patch to add 'use utf8' make sense if applied to file with BOM?

3) People will have to control BOM - they'll invent Test​::BOM, Test​::NoBOM etc to make sure their code not broken because someone commited BOM by accident.
similar things happen now with Tabs vs Spaces - everyone know when and why they should use tabs/spaces for a particular project, they know how to tune their IDE, but
they still commit tabs instead of spaces sometimes. Thing with BOM is much worse, because it leads to hidden bugs in application.

4) How other programming languages control this? I know Ruby use special pragma "#encoding". They've choosen to not use BOM. What about others?

5) TemplateToolkit uses BOM for similar purpose. My observation that it's PITA. Everytime something breaks I have to
grep -rl $'\xEF\xBB\xBF' . to make sure it was not BOM.

6) There are still people out of there who use UTF-8 files without 'use utf8' - they want utf-8 constants without flag.

7) Security issue? Someone compromissed a server, added BOM to a file, introduced security hole, and no one can't find it.

Above does not apply to UTF-16, because UTF-8 and UTF-16 are different. UTF-8 is ASCII-compatible and UTF-16 - no. so

a) UTF-16 cannot live without BOM because it's not ASCII compatible. It needs BOM anyway.
b) UTF-16 needs BOM because there are LE and BE.
c) UTF-16 with wrong bom / missing BOM will not compile at all, most likely.

(altrought I think that even with UTF-16 BOM, BOM just should be ignored and text parsed as UTF-16 but without 'use utf8' behaviour).

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @Hugmeir

On Wed, Feb 19, 2014 at 4​:48 PM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Corollary​: Why would you ever have a UTF-8 BOM?

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 16​:35, Victor Efimov via RT
<perlbug-followup@​perl.org> wrote​:

On Wed Feb 19 02​:48​:40 2014, demerphq wrote​:

Personally I think if we support UTF-16 BOM we should support them
all.

Yves

I am opposed to this.

1) 'use utf8' behaviour change how program behaves. It's not just cosmetic/metadata.

2) How patches to add/remove BOM will look? How diff/git/github/IDEs/other tools display BOM ? Will patch to add 'use utf8' make sense if applied to file with BOM?

3) People will have to control BOM - they'll invent Test​::BOM, Test​::NoBOM etc to make sure their code not broken because someone commited BOM by accident.
similar things happen now with Tabs vs Spaces - everyone know when and why they should use tabs/spaces for a particular project, they know how to tune their IDE, but
they still commit tabs instead of spaces sometimes. Thing with BOM is much worse, because it leads to hidden bugs in application.

All of the above points apply equally to UTF-8 and UTF-16. If we
trigger "use utf8" because there is a UTF-16 BOM at the start of the
file then you need to explain why UTF-8 should be treated differently.
None of the above arguments do that.

4) How other programming languages control this? I know Ruby use special pragma "#encoding". They've choosen to not use BOM. What about others?

Why is this relevant? Why not look at how OS'es handle this? On
windows BOM's mean "unicode file", and most editors automatically
insert them. On linux BOM's are generally ignored or cause problems,
and most editors do not automatically insert them.

5) TemplateToolkit uses BOM for similar purpose. My observation that it's PITA. Everytime something breaks I have to
grep -rl $'\xEF\xBB\xBF' . to make sure it was not BOM.

Yes, TT is doing it right. This is what BOM's are for.

6) There are still people out of there who use UTF-8 files without 'use utf8' - they want utf-8 constants without flag.

So then they should not put a BOM on it.

7) Security issue? Someone compromissed a server, added BOM to a file, introduced security hole, and no one can't find it.

without supporting evidence that this actually happened I will ignore
this one as FUD.

Above does not apply to UTF-16, because UTF-8 and UTF-16 are different. UTF-8 is ASCII-compatible and UTF-16 - no. so

This doesn't make sense to me.

a) UTF-16 cannot live without BOM because it's not ASCII compatible. It needs BOM anyway.
b) UTF-16 needs BOM because there are LE and BE.

You could argue that all Unicode files need a BOM otherwise you can't
tell them from some other arbitrary encoding.

c) UTF-16 with wrong bom / missing BOM will not compile at all, most likely.

True, but then that is what I expect from Perl encountering a utf8 bom as well.

(altrought I think that even with UTF-16 BOM, BOM just should be ignored and text parsed as UTF-16 but without 'use utf8' behaviour).

IMO that doesn't make sense. The 'use utf8' behavior tells perl that
octet sequences are unicode and not binary. It would make no sense to
parse as UTF-16 but then not be able to create string constants....

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 16​:57, Brian Fraser <fraserbn@​gmail.com> wrote​:

On Wed, Feb 19, 2014 at 4​:48 PM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Corollary​: Why would you ever have a UTF-8 BOM?

As I recall Windows editors normally insert them when writing in utf8.
Also they could come from converting a UTF-16 file to UTF-8.

http​://en.wikipedia.org/wiki/Byte_order_mark

is pretty good for understanding these issues.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @Hugmeir

On Wed, Feb 19, 2014 at 5​:02 PM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 16​:57, Brian Fraser <fraserbn@​gmail.com> wrote​:

On Wed, Feb 19, 2014 at 4​:48 PM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <
perlbug-followup@​perl.org>
wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going
to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as
unicode.

Personally I think if we support UTF-16 BOM we should support them
all.

That would break our code. We have UTF-8 files, but we don't use C<<
use
utf8; >>.

Do they have BOMs in them?

Corollary​: Why would you ever have a UTF-8 BOM?

As I recall Windows editors normally insert them when writing in utf8.
Also they could come from converting a UTF-16 file to UTF-8.

http​://en.wikipedia.org/wiki/Byte_order_mark

is pretty good for understanding these issues.

Hm, I think that the question that I meant to ask was, Why would you ever
*want* a UTF-8 BOM? It strikes me that at best they're an annoyance added
by Windows editors.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From victor@vsespb.ru

On Wed Feb 19 08​:01​:09 2014, demerphq wrote​:

All of the above points apply equally to UTF-8 and UTF-16. If we

No, I explained how UTF-8 differs from UTF-16. UTF-8 is ascii compatible,

Why is this relevant? Why not look at how OS'es handle this? On
windows BOM's mean "unicode file", and most editors automatically

Because Perl is programming language (interpreter) and not OS, nor text editor. So only thing that is relevant is how other programming languages behave.

insert them. On linux BOM's are generally ignored or cause problems,
and most editors do not automatically insert them.

5) TemplateToolkit uses BOM for similar purpose. My observation that
it's PITA. Everytime something breaks I have to
grep -rl $'\xEF\xBB\xBF' . to make sure it was not BOM.

Yes, TT is doing it right. This is what BOM's are for.

No, BOM is to distinct LE vs BE. And BOM for UTF-8 is something non-standard​:

http​://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[2] but does not require nor recommend its use

6) There are still people out of there who use UTF-8 files without
'use utf8' - they want utf-8 constants without flag.

So then they should not put a BOM on it.

Again, I explained that there will be a mess just like with Tabs vs Spaces.

7) Security issue? Someone compromissed a server, added BOM to a
file, introduced security hole, and no one can't find it.

without supporting evidence that this actually happened I will ignore
this one as FUD.

Happened what? UTF-8 BOM is not triggering anything in any released version of perl, yet. So nothing could happened.
Point is UTF-8 BOM is hard to find by eyes (does not apply to UTF-16 BOM). And if you see the code and could not understand how it behaves without hex editor - it can lead to security issues.

(altrought I think that even with UTF-16 BOM, BOM just should be
ignored and text parsed as UTF-16 but without 'use utf8' behaviour).

IMO that doesn't make sense. The 'use utf8' behavior tells perl that
octet sequences are unicode and not binary. It would make no sense to
parse as UTF-16 but then not be able to create string constants....

Agree, that would be pretty strange to parse UTF-16, convent it to UTF-8 and use it without utf-8 flag. But allowing file format
to change program behaviour is bad idea anyway. Better use explicit pragmas for this.

Yves

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From perl5-porters@perl.org

Yves Orton wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Some of mine do, and fit that description perfectly. It happens when
I find myself stuck in Windows world without tools I'm familiar with.
At least I can get things working. Having the BOM imply 'use utf8'
would make things harder..

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 17​:16, Victor Efimov via RT
<perlbug-followup@​perl.org> wrote​:

On Wed Feb 19 08​:01​:09 2014, demerphq wrote​:

All of the above points apply equally to UTF-8 and UTF-16. If we

No, I explained how UTF-8 differs from UTF-16. UTF-8 is ascii compatible,

You keep saying that. And I keep thinking you must mean something
other than I do when I say "ascii compatible".

Think about squares and rectangles. Squares are a subset of
rectangles. But not all rectangles are squares.

UTF-8 is NOT ASCII compatible. You cannot take any arbitrary UTF-8
string and convert it to ASCII.

You can convert ASCII to UTF-8. And indeed ASCII is a subset of UTF-8.

Why is this relevant? Why not look at how OS'es handle this? On
windows BOM's mean "unicode file", and most editors automatically

Because Perl is programming language (interpreter) and not OS, nor text editor. So only thing that is relevant is how other programming languages behave.

Ok. Well IMO most programming languages dont know anything about
Unicode. Which is hardly surprising considering most of them predate
unicode.

insert them. On linux BOM's are generally ignored or cause problems,
and most editors do not automatically insert them.

5) TemplateToolkit uses BOM for similar purpose. My observation that
it's PITA. Everytime something breaks I have to
grep -rl $'\xEF\xBB\xBF' . to make sure it was not BOM.

Yes, TT is doing it right. This is what BOM's are for.

No, BOM is to distinct LE vs BE. And BOM for UTF-8 is something non-standard​:

http​://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[2] but does not require nor recommend its use

Sigh. I guess we have a different definition of "non-standard".
Something that the standard allows, is not non-standard. It may be
irregular. But the standard allows it. So its part of the standard,
thus it can not be "non-standard". By definition.

6) There are still people out of there who use UTF-8 files without
'use utf8' - they want utf-8 constants without flag.

So then they should not put a BOM on it.

Again, I explained that there will be a mess just like with Tabs vs Spaces.

I dont think that is relevent to a discussion about why we should
consistently apply the same rules to BOM's regardless of the encoding
format.

7) Security issue? Someone compromissed a server, added BOM to a
file, introduced security hole, and no one can't find it.

without supporting evidence that this actually happened I will ignore
this one as FUD.

Happened what? UTF-8 BOM is not triggering anything in any released version of perl, yet. So nothing could happened.
Point is UTF-8 BOM is hard to find by eyes (does not apply to UTF-16 BOM). And if you see the code and could not understand how it behaves without hex editor - it can lead to security issues.

So it *is* FUD.

(altrought I think that even with UTF-16 BOM, BOM just should be
ignored and text parsed as UTF-16 but without 'use utf8' behaviour).

IMO that doesn't make sense. The 'use utf8' behavior tells perl that
octet sequences are unicode and not binary. It would make no sense to
parse as UTF-16 but then not be able to create string constants....

Agree, that would be pretty strange to parse UTF-16, convent it to UTF-8 and use it without utf-8 flag. But allowing file format
to change program behaviour is bad idea anyway. Better use explicit pragmas for this.

I think we will just have to agree to disagree.

Yves

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From perl5-porters@perl.org

Yves Orton wrote​:

All of the above points apply equally to UTF-8 and UTF-16. If we
trigger "use utf8" because there is a UTF-16 BOM at the start of the
file then you need to explain why UTF-8 should be treated differently.
None of the above arguments do that.

Backward compatibility.

6) There are still people out of there who use UTF-8 files without 'use utf8'
- they want utf-8 constants without flag.

So then they should not put a BOM on it.

So my scripts that already have BOMbs will start to behave
erratically?

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @demerphq

On 19 February 2014 17​:26, Father Chrysostomos <sprout@​cpan.org> wrote​:

Yves Orton wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Some of mine do, and fit that description perfectly. It happens when
I find myself stuck in Windows world without tools I'm familiar with.
At least I can get things working. Having the BOM imply 'use utf8'
would make things harder..

Since you are the king of consistency I am a bit surprised. :-)

The core of the issue here is that if you took that file and naively
converted it from UTF-8 to UTF-16 it would behave differently.

IMO either we should respect BOM's or we should not. Which encoding is
involved UTF-7, UTF-8, UTF-16, UTF-32 should not matter.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From victor@vsespb.ru

On Wed Feb 19 08​:29​:05 2014, demerphq wrote​:

On 19 February 2014 17​:16, Victor Efimov via RT
<perlbug-followup@​perl.org> wrote​:

On Wed Feb 19 08​:01​:09 2014, demerphq wrote​:

All of the above points apply equally to UTF-8 and UTF-16. If we

No, I explained how UTF-8 differs from UTF-16. UTF-8 is ascii
compatible,

You keep saying that. And I keep thinking you must mean something
other than I do when I say "ascii compatible".

Think about squares and rectangles. Squares are a subset of
rectangles. But not all rectangles are squares.

UTF-8 is NOT ASCII compatible. You cannot take any arbitrary UTF-8
string and convert it to ASCII.

You can convert ASCII to UTF-8. And indeed ASCII is a subset of UTF-8.

http​://www.cl.cam.ac.uk/~mgk25/ucs/man-utf-8.html
UTF-8 - an ASCII compatible multibyte Unicode encoding

http​://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
UTF-16 and UTF-32 are incompatible with ASCII files

In our case that means that perl won't compile UTF-16 files with missing BOM. Because perl operators are all in ASCII and UTF-16 is not ASCII compatible. Thus you can't mess with UTF-16 BOM.

Why is this relevant? Why not look at how OS'es handle this? On
windows BOM's mean "unicode file", and most editors automatically

Because Perl is programming language (interpreter) and not OS, nor
text editor. So only thing that is relevant is how other programming
languages behave.

Ok. Well IMO most programming languages dont know anything about
Unicode. Which is hardly surprising considering most of them predate
unicode.

Ruby do. And it does not use BOM.

There should be a success story with programming languge changing program behaviour because of UTF-8 BOM. Otherwise it's risky.

As for TemplateToolkit - I have experience when HTML designers refused to deal with BOM. And thus we had no Unicode in our templates (we had UTF-8 without utf-8 flag). That made migration to unicode just harder.

No, BOM is to distinct LE vs BE. And BOM for UTF-8 is something non-
standard​:

http​://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
The Unicode Standard permits the BOM in UTF-8,[2] but does not
require nor recommend its use

Sigh. I guess we have a different definition of "non-standard".
Something that the standard allows, is not non-standard. It may be
irregular. But the standard allows it. So its part of the standard,
thus it can not be "non-standard". By definition.

ok, not recommended.

6) There are still people out of there who use UTF-8 files without
'use utf8' - they want utf-8 constants without flag.

So then they should not put a BOM on it.

Again, I explained that there will be a mess just like with Tabs vs
Spaces.

I dont think that is relevent to a discussion about why we should
consistently apply the same rules to BOM's regardless of the encoding
format.

it's relevant imho.

Happened what? UTF-8 BOM is not triggering anything in any released
version of perl, yet. So nothing could happened.
Point is UTF-8 BOM is hard to find by eyes (does not apply to UTF-16
BOM). And if you see the code and could not understand how it behaves
without hex editor - it can lead to security issues.

So it *is* FUD.

It's not FUD. I explained why it can be security issue. And you've just asked me to provide proof that there _was_ a security problem with _not-yet-released_ feature? That *is* BS.

also, here is example​:

my $s = <<"END";
Hello
END

print length $s;

it prints 6 with LF line feed and CRLF line feed. program behaviour not affected by file format.

and now we're going to break this. any code sent over email should have attached note "use without BOM" "use with BOM"

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @ikegami

On Wed, Feb 19, 2014 at 10​:57 AM, Brian Fraser <fraserbn@​gmail.com> wrote​:

On Wed, Feb 19, 2014 at 4​:48 PM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 16​:23, Eric Brine <ikegami@​adaelis.com> wrote​:

On Wed, Feb 19, 2014 at 5​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

On 19 February 2014 01​:46, Tony Cook via RT <perlbug-followup@​perl.org

wrote​:

On Tue Feb 18 01​:04​:36 2014, efimov@​reg.ru wrote​:

proposed patch attached.

some related discussion​:
http​://www.nntp.perl.org/group/perl.unicode/1999/10/msg1.html

currently BOM does not seem to trigger 'use utf8' behaviour.
one possible reason to not to merge this patch is if we're going to
use BOM in the future.

A UTF-16 BOM is recognized though, and treats the source as unicode.

Personally I think if we support UTF-16 BOM we should support them all.

That would break our code. We have UTF-8 files, but we don't use C<< use
utf8; >>.

Do they have BOMs in them?

Yes, so that UltraEdit detects them as UTF-8.

Corollary​: Why would you ever have a UTF-8 BOM?

To distinguish UTF-8 files from native files. This is particularly common
practice on Windows.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From victor@vsespb.ru

On Wed Feb 19 07​:35​:49 2014, vsespb wrote​:

4) How other programming languages control this? I know Ruby use
special pragma "#encoding". They've choosen to not use BOM. What about
others?

UPD​:
tried to test ruby and python.

1) ruby1.8 - does not support UTF-8 BOM

2) ruby1.9 - does support it (together with "encoding" pseudo-comment)
however, without BOM and without "encoding" ruby complaints​: "invalid multibyte char (US-ASCII)"

so you cannot use wrong encoding for file. you cannot change program behaviour by putting or removing BOM (unlike Perl)

(note that if file is pure ASCII, actually you can use it with and without BOM. and string constant encoding will be different,
this difference, however, should not affect how program behaves (because data is ASCII-only), but probably can affect it in edge cases and in case of bugs (bugs like The Unicode Bug in perl) )

3) python 2.7.3 - supports both "encoding" pseudo-comment and UTF-8 BOM.
however, just like ruby it complaits "SyntaxError​: Non-ASCII character '\xd0' in file " if you use non-ASCII in file without BOM.
so, again, you can't change program behaviour with wrong BOM (unlike Perl).

(note that I don't know python at all, I might miss something?)

4) python does not support UTF-16
http​://www.python.org/dev/peps/pep-0263/
"It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. The reason for this is to keep the encoding
detection algorithm in the tokenizer simple."

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2014

From @Leont

On Wed, Feb 19, 2014 at 11​:48 AM, demerphq <demerphq@​gmail.com> wrote​:

Personally I think if we support UTF-16 BOM we should support them all.

No, we should not.

A BOM was never intended as an encoding marker. It's intended as a byte
order mark. UTF-8 does not have a byte order, so using a byte order mark is
just silly and wrong, and probably mostly a result of dumb UTF-16 to UTF-8
conversion (see CESU-8 for even worse idiocy). A UTF-32LE BOM is also a
valid UTF-16LE BOM. A UCS-2 BOM is by definition identical to a UTF-16 BOM.
Not to mention the issue that they're valid ISO-8859-$any sequences too.

Actually, I'm not sure UTF-16 support was a good idea either. Given it
doesn't play all that nice with a shebang I can't imagine anyone using it
on unix for purposes other than obfuscation. I can vaguely imagine people
using it on Windows if they don't know how to set their editor to a proper
encoding, but I see absolutely no advantages. I'm rather curious how many
edge-cases are handled usefully, such as __DATA__, POD and binary literals.

Leon

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2014

From perl5-porters@perl.org

Yves Orton​:

Since you are the king of consistency I am a bit surprised. :-)

The core of the issue here is that if you took that file and naively
converted it from UTF-8 to UTF-16 it would behave differently.

But I never intend to do that! :-)

Honestly, I don't care how UTF-16 files are treated.

@p5pRT
Copy link
Author

p5pRT commented Feb 21, 2014

From @tonycoz

On Wed Feb 19 03​:56​:08 2014, efimov@​reg.ru wrote​:

2014-02-19 4​:46 GMT+04​:00 Tony Cook via RT <perlbug-followup@​perl.org>​:

A UTF-16 BOM is recognized though, and treats the source as unicode.

yes. indeed. starting from ~ 5.12.

I wonder if this should be documented in 'utf8' or no​:
- UTF-16 is not UTF-8
- there is no 'use' equivalent for UTF-16 BOM.
maybe perlrun is a better place?

Turns out it recognizes but skips if there is a UTF-8 BOM. Despite this from perlunicode.pod​:

=item C<BOM>-marked scripts and UTF-16 scripts autodetected

If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)

the source doesn't appear to be treated as unicode​:

tony@​mars​:.../git/perl2$ hd test.pl
00000000 ef bb bf 70 72 69 6e 74 20 22 54 65 73 74 5c 6e |...print "Test\n|
00000010 22 3b 0a 70 72 69 6e 74 20 6f 72 64 28 27 ce a3 |";.print ord('..|
00000020 27 29 2c 20 22 5c 6e 22 3b 0a |'), "\n";.|
0000002a
You have mail in /var/mail/tony
tony@​mars​:.../git/perl2$ ./perl test.pl
Test
206

So, except for the above, the behaviour appears to be documented in perlunicode, which is one of the logical places to look for it.

I don't think it needs further documentation in perlrun.

As to support for other BOMs - that belongs in a new ticket.

Tony

@khwilliamson
Copy link
Contributor

So, what should be done with this ticket?

@xenu xenu removed the Severity Low label Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants