Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomalies in handling malformed utf8 input #16504

Closed
p5pRT opened this issue Apr 11, 2018 · 15 comments
Closed

Anomalies in handling malformed utf8 input #16504

p5pRT opened this issue Apr 11, 2018 · 15 comments

Comments

@p5pRT
Copy link

p5pRT commented Apr 11, 2018

Migrated from rt.perl.org#133101 (status was 'rejected')

Searchable as RT133101$

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @rjbs

Mark Dominus sent me a bug report that he couldn't get perlbug to accept.


This is a bug report.

The attached input file “bad” is a one-line summary of an email message
whose subject field was malformed. The subject field is encoded in GB-2312
and its raw bytes are invalid when interpreted as utf8. Let us suppose
that this data is saved in a file named bad. Now consider the following
invocations​:

1$ perl -lne 'print if /[ąę]/' bad > /dev/null
2$ PERL_UNICODE=39 perl -lne 'print if /[ąę]/' bad > /dev/null
3$ cat bad | perl -lne 'print if /[ąę]/' > /dev/null
4$ cat bad | PERL_UNICODE=39 perl -lne 'print if /[ąę]/' > /dev/null
Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

5$ perl -lne 'print if /ą/' bad > /dev/null
6$ PERL_UNICODE=39 perl -lne 'print if /ą/' bad > /dev/null
7$ cat bad | perl -lne 'print if /ą/' > /dev/null
8$ cat bad | PERL_UNICODE=39 perl -lne 'print if /ą/' > /dev/null

There are at least two anomalies here.

Invocation 4 properly fails. (PERL_UNICODE=39 is equivalent to supplying
the -CAS flag to Perl.) But invocation 8 is identical, except that the
pattern is /ą/ instead of /[ąę]/; why doesn't this fail as well?

Invocation 2 is completely identical, except that the data is delivered on
stdin rather than coming from ARGV. Why doesn't this fail as well? (The
data itself is identical, as confirmed by cat bad | cmp - bad).

The complete message header is also attached (msg-hdr.txt), and the
examples above all behave the same when I use it in place of the shorter
excerpt.

This is perl 5, version 22, subversion 1 (v5.22.1) built for
x86_64-linux-gnu-thread-multi
(with 60 registered patches, see the attached output of perl -V for more
detail)

Please cc me on replies, as I do not regularly read this list.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @rjbs

  6 01/02 " ������Ч�Ŀ�չ�ֳ����ճ�����-����վ<<������​:�����

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @rjbs

Summary of my perl5 (revision 5 version 22 subversion 1) configuration​:
 
  Platform​:
  osname=linux, osvers=3.16.0, archname=x86_64-linux-gnu-thread-multi
  uname='linux localhost 3.16.0 #1 smp debian 3.16.0 x86_64 gnulinux '
  config_args='-Dusethreads -Duselargefiles -Dcc=x86_64-linux-gnu-gcc -Dcpp=x86_64-linux-gnu-cpp -Dld=x86_64-linux-gnu-gcc -Dccflags=-DDEBIAN -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-Bsymbolic-functions -Wl,-z,relro -Dlddlflags=-shared -Wl,-Bsymbolic-functions -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.22 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.22 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.22 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.22.1 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.22.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -dEs -Duseshrplib -Dlibperl=libperl.so.5.22.1'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='x86_64-linux-gnu-gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='5.4.0 20160609', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='x86_64-linux-gnu-gcc', ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/5/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=libc-2.23.so, so=so, useshrplib=true, libperl=libperl.so.5.22
  gnulibc_version='2.23'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector-strong'

Characteristics of this binary (from libperl)​:
  Compile-time options​: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
  PERL_DONT_CREATE_GVSV
  PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
  PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
  PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV
  USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS
  USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
  USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_LOCALE_TIME
  USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API
  Locally applied patches​:
  DEBPKG​:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
  DEBPKG​:debian/db_file_ver - http​://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
  DEBPKG​:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
  DEBPKG​:debian/enc2xs_inc - http​://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @​INC directories.
  DEBPKG​:debian/errno_ver - http​://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
  DEBPKG​:debian/libperl_embed_doc - http​://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
  DEBPKG​:fixes/respect_umask - Respect umask during installation
  DEBPKG​:debian/writable_site_dirs - Set umask approproately for site install directories
  DEBPKG​:debian/extutils_set_libperl_path - EU​:MM​: set location of libperl.a under /usr/lib
  DEBPKG​:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
  DEBPKG​:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
  DEBPKG​:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
  DEBPKG​:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
  DEBPKG​:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
  DEBPKG​:debian/mod_paths - Tweak @​INC ordering for Debian
  DEBPKG​:debian/prune_libs - http​://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
  DEBPKG​:fixes/net_smtp_docs - [rt.cpan.org #36038] http​://bugs.debian.org/100195 Document the Net​::SMTP 'Port' option
  DEBPKG​:debian/perlivp - http​://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
  DEBPKG​:debian/deprecate-with-apt - http​://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
  DEBPKG​:debian/squelch-locale-warnings - http​://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
  DEBPKG​:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
  DEBPKG​:debian/patchlevel - http​://bugs.debian.org/567489 List packaged patches for 5.22.1-9ubuntu0.2 in patchlevel.h
  DEBPKG​:debian/skip-kfreebsd-crash - http​://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
  DEBPKG​:fixes/document_makemaker_ccflags - http​://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
  DEBPKG​:debian/find_html2text - http​://bugs.debian.org/640479 Configure CPAN​::Distribution with correct name of html2text
  DEBPKG​:debian/perl5db-x-terminal-emulator.patch - http​://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
  DEBPKG​:debian/cpan-missing-site-dirs - http​://bugs.debian.org/688842 Fix CPAN​::FirstTime defaults with nonexisting site dirs if a parent is writable
  DEBPKG​:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http​://bugs.debian.org/587650 Memoize​::Storable​: respect 'nstore' option not respected
  DEBPKG​:debian/regen-skip - Skip a regeneration check in unrelated git repositories
  DEBPKG​:debian/makemaker-pasthru - http​://bugs.debian.org/758471 Pass LD settings through to subdirectories
  DEBPKG​:fixes/pod_man_reproducible_date - http​://bugs.debian.org/759405 Support POD_MAN_DATE in Pod​::Man for the left-hand footer
  DEBPKG​:debian/locale-robustness - http​://bugs.debian.org/782068 [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL
  DEBPKG​:fixes/podman-utc - http​://bugs.debian.org/780259 Make the embedded date from Pod​::Man reproducible
  DEBPKG​:fixes/podman-utc-docs - http​://bugs.debian.org/780259 Documentation and test suite updates for UTC fix
  DEBPKG​:fixes/podman-empty-date - http​://bugs.debian.org/780259 Support an empty POD_MAN_DATE environment variable
  DEBPKG​:fixes/podman-pipe - http​://bugs.debian.org/777405 Better errors for man pages from standard input
  DEBPKG​:debian/pod2man-customized - Update porting/customized.dat for pod2man modifications
  DEBPKG​:debian/makemaker-manext - http​://bugs.debian.org/247370 Make EU​::MakeMaker honour MANnEXT settings in generated manpage headers
  DEBPKG​:debian/makemaker_customized - Update t/porting/customized.dat for files patched in Debian
  DEBPKG​:debian/do-not-record-build-date - [6baa8db] http​://bugs.debian.org/774422 [perl #125830] Allow overriding the compile time in "perl -V" output
  DEBPKG​:fixes/podman-source-date-epoch - http​://bugs.debian.org/801621 Make Pod​::Man honor the SOURCE_DATE_EPOCH environment variable
  DEBPKG​:fixes/podman-source-date-epoch-cleanups - http​://bugs.debian.org/801621 Coding style and documentation for SOURCE_EPOCH_DATE
  DEBPKG​:fixes/podman-source-date-epoch-testfix - http​://bugs.debian.org/807086 Guard for building with SOURCE_DATE_EPOCH or POD_MAN_DATE set
  DEBPKG​:debian/devel-ppport-reproducibility - http​://bugs.debian.org/801523 Sort the list of XS code files when generating RealPPPort.xs
  DEBPKG​:fixes/encode-unicode-bom - http​://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
  DEBPKG​:debian/encode-unicode-bom-doc - http​://bugs.debian.org/798727 Document Debian backport of Encode​::Unicode fix
  DEBPKG​:debian/kfreebsd-softupdates - http​://bugs.debian.org/796798 Work around Debian Bug#796798
  DEBPKG​:fixes/autodie-scope - http​://bugs.debian.org/798096 Fix a scoping issue with "no autodie" and the "system" sub
  DEBPKG​:debian/debugperl-compat-fix - [perl #127212] http​://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
  DEBPKG​:fixes/CVE-2015-8607_file_spec_taint_fix - http​://bugs.debian.org/810719 [perl #126862] ensure File​::Spec​::canonpath() preserves taint
  DEBPKG​:fixes/mkstemp-umask - http​://bugs.debian.org/810924 [perl #127322] [e57270b] Fix umask for mkstemp(3) calls
  DEBPKG​:fixes/crosscompile-no-targethost - [perl #127234] Fix the Configure escape with usecrosscompile but no targethost
  DEBPKG​:fixes/podlators-no-encode - [rt.cpan.org #111156] Degrade gracefully if utf8 is requested but Encode is not available
  DEBPKG​:debian/cross-time-hires - [rt.cpan.org #111391] Add an environment variable to skip running configuration probes
  DEBPKG​:fixes/encode-unicode-pod - Unicode.pm​: Fix POD error
  DEBPKG​:fixes/memoize-pod - [rt.cpan.org #89441] Fix POD errors in Memoize
  DEBPKG​:fixes/ok-pod - Added encoding for pod.
  DEBPKG​:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ
  DEBPKG​:fixes/CVE-2017-12837.patch - [PATCH] regcomp [perl #131582]
  DEBPKG​:fixes/CVE-2017-12883.patch - [PATCH] PATCH​: [perl #131598]
  Built under linux
  Compiled at Nov 10 2017 14​:39​:06
  @​INC​:
  /etc/perl
  /usr/local/lib/x86_64-linux-gnu/perl/5.22.1
  /usr/local/share/perl/5.22.1
  /usr/lib/x86_64-linux-gnu/perl5/5.22
  /usr/share/perl5
  /usr/lib/x86_64-linux-gnu/perl/5.22
  /usr/share/perl/5.22
  /usr/local/lib/site_perl
  /usr/lib/x86_64-linux-gnu/perl-base
  .

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @rjbs

Return-Path​: <yjzdjae3560@​ywart.com>
Delivered-To​: mjd-deliver-ham@​plover.com
Received​: (qmail 16384 invoked by uid 119); 1 Jan 2017 20​:09​:13 -0000
Delivered-To​: mjd-postspamc@​plover.com
Received​: (qmail 16337 invoked by uid 119); 1 Jan 2017 20​:09​:11 -0000
Delivered-To​: mjd@​plover.com
Received​: (qmail 16323 invoked by uid 119); 1 Jan 2017 20​:09​:11 -0000
Delivered-To​: mjd-perl-patch@​plover.com
Received​: (qmail 16294 invoked from network); 1 Jan 2017 20​:08​:53 -0000
Received​: from unknown (HELO 0C6CC33CC389406.yinksoft.com) (27.20.38.203)
  by plover.com with SMTP; 1 Jan 2017 20​:08​:53 -0000
Date​: Mon, 2 Jan 2017 04​:08​:53 +0800
Subject​: ������Ч�Ŀ�չ�ֳ����ճ�����-����վ
From​: "�㶫���鼯��" <yjzdjae3560@​ywart.com>
Reply-To​: "�㶫���鼯��" <yjzdjae3560@​ywart.com>
To​: <mjd-perl-patch@​plover.com>
Message-ID​: <SAK20170102$43E5EC63.$68856FF6@​ywart.com>
Content-Type​: multipart/alternative;
  boundary="----=_SAKbound_04_0853_20170102_5E52AB54.7EA16EAB"
Content-Transfer-Encoding​: quoted-printable
X-Priority​: 3
Author​: yinksoft

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @khwilliamson

On 04/11/2018 11​:17 AM, Ricardo SIGNES (via RT) wrote​:

# New Ticket Created by Ricardo SIGNES
# Please include the string​: [perl #133101]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133101 >

Mark Dominus sent me a bug report that he couldn't get perlbug to accept.

-------

This is a bug report.

The attached input file “bad” is a one-line summary of an email message
whose subject field was malformed. The subject field is encoded in GB-2312
and its raw bytes are invalid when interpreted as utf8. Let us suppose
that this data is saved in a file named bad. Now consider the following
invocations​:

1$ perl -lne 'print if /[ąę]/' bad > /dev/null
2$ PERL_UNICODE=39 perl -lne 'print if /[ąę]/' bad > /dev/null
3$ cat bad | perl -lne 'print if /[ąę]/' > /dev/null
4$ cat bad | PERL_UNICODE=39 perl -lne 'print if /[ąę]/' > /dev/null
Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

5$ perl -lne 'print if /ą/' bad > /dev/null
6$ PERL_UNICODE=39 perl -lne 'print if /ą/' bad > /dev/null
7$ cat bad | perl -lne 'print if /ą/' > /dev/null
8$ cat bad | PERL_UNICODE=39 perl -lne 'print if /ą/' > /dev/null

Shouldn't

use utf8

be used?

There are at least two anomalies here.

Invocation 4 properly fails. (PERL_UNICODE=39 is equivalent to supplying
the -CAS flag to Perl.) But invocation 8 is identical, except that the
pattern is /ą/ instead of /[ąę]/; why doesn't this fail as well?

Invocation 2 is completely identical, except that the data is delivered on
stdin rather than coming from ARGV. Why doesn't this fail as well? (The
data itself is identical, as confirmed by cat bad | cmp - bad).

The complete message header is also attached (msg-hdr.txt), and the
examples above all behave the same when I use it in place of the shorter
excerpt.

This is perl 5, version 22, subversion 1 (v5.22.1) built for
x86_64-linux-gnu-thread-multi
(with 60 registered patches, see the attached output of perl -V for more
detail)

Please cc me on replies, as I do not regularly read this list.

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @Grinnz

On Wed, Apr 11, 2018 at 4​:46 PM, Karl Williamson <public@​khwilliamson.com>
wrote​:

On 04/11/2018 11​:17 AM, Ricardo SIGNES (via RT) wrote​:

# New Ticket Created by Ricardo SIGNES
# Please include the string​: [perl #133101]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133101 >

Mark Dominus sent me a bug report that he couldn't get perlbug to accept.

-------

This is a bug report.

The attached input file “bad” is a one-line summary of an email message
whose subject field was malformed. The subject field is encoded in GB-2312
and its raw bytes are invalid when interpreted as utf8. Let us suppose
that this data is saved in a file named bad. Now consider the following
invocations​:

1$ perl -lne 'print if /[ąę]/' bad > /dev/null
2$ PERL_UNICODE=39 perl -lne 'print if /[ąę]/' bad > /dev/null
3$ cat bad | perl -lne 'print if /[ąę]/' > /dev/null
4$ cat bad | PERL_UNICODE=39 perl -lne 'print if /[ąę]/' > /dev/null
Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

5$ perl -lne 'print if /ą/' bad > /dev/null
6$ PERL_UNICODE=39 perl -lne 'print if /ą/' bad > /dev/null
7$ cat bad | perl -lne 'print if /ą/' > /dev/null
8$ cat bad | PERL_UNICODE=39 perl -lne 'print if /ą/' > /dev/null

Shouldn't

use utf8

be used?

Yes. -CAS only sets @​ARGV to be interpreted as UTF-8 and :utf8 layers on
STDIN/STDOUT/STDERR. The source code still needs `use utf8;` to be
interpreted correctly.

-Dan

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2018

From @Grinnz

On Wed, Apr 11, 2018 at 4​:52 PM, Dan Book <grinnz@​gmail.com> wrote​:

Yes. -CAS only sets @​ARGV to be interpreted as UTF-8 and :utf8 layers on
STDIN/STDOUT/STDERR. The source code still needs `use utf8;` to be
interpreted correctly.

-Dan

Using the options -CSD (-CD makes the special ARGV handle used by -n open
the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
-Mutf8 (for the source code passed to -e) should make these examples
function as expected.

-Dan

@p5pRT
Copy link
Author

p5pRT commented Apr 12, 2018

From @rjbs

On Wed, 11 Apr 2018 14​:35​:20 -0700, grinnz@​gmail.com wrote​:

Using the options -CSD (-CD makes the special ARGV handle used by -n open
the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
-Mutf8 (for the source code passed to -e) should make these examples
function as expected.

-Dan

I'm not sure this is sufficient explanation. Consider​:

  ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
  ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
  Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

Our input comes from stdin, and we have use -CS, which means STDIN is assumed UTF-8. In both cases, we use -Mutf8. We only see a fatal error in the second case, when we have used a character class instead of a string.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Apr 12, 2018

From @khwilliamson

On 04/12/2018 08​:10 AM, Ricardo SIGNES via RT wrote​:

On Wed, 11 Apr 2018 14​:35​:20 -0700, grinnz@​gmail.com wrote​:

Using the options -CSD (-CD makes the special ARGV handle used by -n open
the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
-Mutf8 (for the source code passed to -e) should make these examples
function as expected.

-Dan

I'm not sure this is sufficient explanation. Consider​:

~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

Our input comes from stdin, and we have use -CS, which means STDIN is assumed UTF-8. In both cases, we use -Mutf8. We only see a fatal error in the second case, when we have used a character class instead of a string.

I'm not sure the file survived the email transfer intact, because I
saved it and get a bunch of REPLACEMENT CHARACTERS, and so can't
reproduce it.

But I know the reason one fails and the other doesn't. Perl does not
currently examine its input for utf8 validity unless the proper layer is
used, which this isn't. That is a source of frustration to both rjbs
and me.

We also don't got out of our way to make validity checks as we execute.
Those checks are only done if the result somehow depends on them. If we
can, for example, fail a match without needing to know the UTF-8
validity of the target string, we do so, without slowing down everything
while we check, perhaps for the umpteenth time, that the string is valid.

That is what is happening here, as you can see if you add -Dr. As an
aside, that is the first thing an experienced perl programmer should do
when thinking there is a regex bug.

In the first case, you get this​:
UTF-8 pattern and string...
Intuit​: trying to determine minimum start position...
  doing 'check' fbm scan, [0..146] gave -1

  Did not find anchored substr "%x{105}%x{119}"...
Match rejected by optimizer

In this case, we can tell that the match will fail because we first use
fast boyers moore for the 4 byte sequence that comprises the needed
string. It wasn't there, so no need to look in more detail.

The second case is different.
I get (with my sanitized input)
UTF-8 pattern and string...
Matching stclass ANYOF[0105 0119] against " 6 01/02 %"
  %x{fffd}%x{fffd}%x{fffd}"... (146 bytes)
Contradicts stclass... [regexec_flags]
Match failed

In this case we don't do a byte scan, but have to examine the string in
detail, and during that discover that it is malformed.

The fix for this is to fix :utf8 to do validity checking by default.
We're not going to cripple perl's performance by adding validity checks
where the outcome doesn't depend on validity. And we're not going to
make the code more complex by deciding, here we may be able to ignore
that it's invalid, and press on. To prevent segfaults and stuff, we
have to refuse to handle invalid utf-8 when it matters.

@p5pRT
Copy link
Author

p5pRT commented Apr 12, 2018

From @khwilliamson

On 04/12/2018 09​:36 AM, Karl Williamson wrote​:

On 04/12/2018 08​:10 AM, Ricardo SIGNES via RT wrote​:

On Wed, 11 Apr 2018 14​:35​:20 -0700, grinnz@​gmail.com wrote​:

Using the options -CSD (-CD makes the special ARGV handle used by -n
open
the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
-Mutf8 (for the source code passed to -e) should make these examples
function as expected.

-Dan

I'm not sure this is sufficient explanation.  Consider​:

   ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /ąę/'
   ~$ cat bad | perl -CAS -Mutf8 -lne 'print if /[ąę]/'
   Malformed UTF-8 character (fatal) at -e line 1, <> line 1.

Our input comes from stdin, and we have use -CS, which means STDIN is
assumed UTF-8.  In both cases, we use -Mutf8.  We only see a fatal
error in the second case, when we have used a character class instead
of a string.

I'm not sure the file survived the email transfer intact, because I
saved it and get a bunch of REPLACEMENT CHARACTERS, and so can't
reproduce it.

But I know the reason one fails and the other doesn't.  Perl does not
currently examine its input for utf8 validity unless the proper layer is
used, which this isn't.  That is a source of frustration to both rjbs
and me.

We also don't got out of our way to make validity checks as we execute.
Those checks are only done if the result somehow depends on them.  If we
can, for example, fail a match without needing to know the UTF-8
validity of the target string, we do so, without slowing down everything
while we check, perhaps for the umpteenth time, that the string is valid.

That is what is happening here, as you can see if you add -Dr.  As an
aside, that is the first thing an experienced perl programmer should do
when thinking there is a regex bug.

In the first case, you get this​:
UTF-8 pattern and string...
Intuit​: trying to determine minimum start position...
  doing 'check' fbm scan, [0..146] gave -1

  Did not find anchored substr "%x{105}%x{119}"...
Match rejected by optimizer

In this case, we can tell that the match will fail because we first use
fast boyers moore for the 4 byte sequence that comprises the needed
string.  It wasn't there, so no need to look in more detail.

The second case is different.
I get (with my sanitized input)
UTF-8 pattern and string...
Matching stclass ANYOF[0105 0119] against "    6  01/02 %"
%x{fffd}%x{fffd}%x{fffd}"... (146 bytes)
Contradicts stclass... [regexec_flags]
Match failed

In this case we don't do a byte scan, but have to examine the string in
detail, and during that discover that it is malformed.

The fix for this is to fix :utf8 to do validity checking by default.
We're not going to cripple perl's performance by adding validity checks
where the outcome doesn't depend on validity.  And we're not going to
make the code more complex by deciding, here we may be able to ignore
that it's invalid, and press on.  To prevent segfaults and stuff, we
have to refuse to handle invalid utf-8 when it matters.

I believe it's documented somewhere that you can have inconsistent
results with invalid UTF-8 input

@p5pRT
Copy link
Author

p5pRT commented Apr 12, 2018

From @Grinnz

On Thu, Apr 12, 2018 at 10​:10 AM, Ricardo SIGNES via RT <
perlbug-followup@​perl.org> wrote​:

On Wed, 11 Apr 2018 14​:35​:20 -0700, grinnz@​gmail.com wrote​:

Using the options -CSD (-CD makes the special ARGV handle used by -n open
the passed filename with :utf8, -CS interprets the STDIN with :utf8) and
-Mutf8 (for the source code passed to -e) should make these examples
function as expected.

-Dan

I'm not sure this is sufficient explanation.

To clarify, I meant "this should make these examples exhibit the expected
bugs/odd behavior." :)

-Dan

@p5pRT
Copy link
Author

p5pRT commented May 19, 2018

From @khwilliamson

I believe this ticket can be rejected, and will do so if I don't hear opinions to the contrary in the next month-ish
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

From @khwilliamson

Two months without comment, so I am rejecting as scheduled
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2018

@khwilliamson - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant