Skip Menu |
Report information
Id: 126310
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: florian.schlichting [at] fu-berlin.de
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)

Attachments
0001-Proof-of-concept-to-test-input-for-valid-UTF-8.patch



To: perlbug [...] perl.org
Date: Fri, 09 Oct 2015 15:17:40 +0200
From: florian.schlichting [...] fu-berlin.de
Subject: no "Malformed UTF-8 character" warning on single-quoted strings under "use utf8"
Download (untitled) / with headers
text/plain 9.4k
This is a bug report for perl from florian.schlichting@fu-berlin.de, generated with the help of perlbug 1.40 running under perl 5.20.2. ----------------------------------------------------------------- As discovered in the "Malformed UTF-8 character" thread at http://www.perlmonks.org/?node_id=902060 and isolated by tchrist in a reply at http://www.perlmonks.org/?displaytype=print;node_id=902212;replies=1, Perl fails to issue a "Malformed UTF-8 character" warning when running under "use utf8" IF the string in question is enclosed in single quotes. For double quoted strings the warning is issued as expected: % blead -C0 -le 'print qq(print "\xB0C";)' | blead -Mutf8 -CS -l Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte) at - line 1. C % blead -C0 -le 'print qq(print \x27\xB0C\x27;)' | blead -Mutf8 -CS -l #C This should be fixed so that the warning is issued for single quoted strings as well, helping to detect incompletely/incorrectly converted scripts. ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.20.2: Configured by Debian Project at Sun May 3 16:16:25 UTC 2015. Summary of my perl5 (revision 5 version 20 subversion 2) configuration: Platform: osname=linux, osvers=3.2.0-4-amd64, archname=x86_64-linux-gnu-thread-multi uname='linux x86-csail-01 3.2.0-4-amd64 #1 smp debian 3.2.68-1+deb7u1 x86_64 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.9.2', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20 gnulibc_version='2.19' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking DEBPKG:fixes/respect_umask - Respect umask during installation DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3+deb8u1 in patchlevel.h DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require DEBPKG:fixes/array-cloning - http://bugs.debian.org/779357 [perl #124127] [902d169] fix cloning arrays with unused elements DEBPKG:fixes/perldb-threads - http://bugs.debian.org/779357 [perl #124127] [41ef2c6] lib/perl5db.pl: Restore noop lock prototype --- @INC for perl 5.20.2: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 /usr/local/share/perl/5.20.2 /usr/lib/x86_64-linux-gnu/perl5/5.20 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.20 /usr/share/perl/5.20 /usr/local/lib/site_perl . --- Environment for perl 5.20.2: HOME=/home/fschlich LANG=de_DE@euro LANGUAGE (unset) LC_CTYPE=de_DE@euro LC_MESSAGES=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/fschlich/bin:/home/fschlich/bin:/usr/local/sw/i3-jessie/bin:/usr/local/sw/xfce/stable/bin:/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/games PERL_BADLANG (unset) SHELL=/bin/zsh
To: perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Date: Sun, 11 Oct 2015 10:46:41 -0600
Subject: Re: [perl #126310] no "Malformed UTF-8 character" warning on single-quoted strings under "use utf8"
I have taken this ticket, as I'm about to start work on related things. On 10/09/2015 07:17 AM, (via RT) wrote: Show quoted text
> # New Ticket Created by > # Please include the string: [perl #126310] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=126310 > > > > > This is a bug report for perl from florian.schlichting@fu-berlin.de, > generated with the help of perlbug 1.40 running under perl 5.20.2. > > > ----------------------------------------------------------------- > > As discovered in the "Malformed UTF-8 character" thread at > http://www.perlmonks.org/?node_id=902060 and isolated by tchrist in a reply at > http://www.perlmonks.org/?displaytype=print;node_id=902212;replies=1, Perl > fails to issue a "Malformed UTF-8 character" warning when running under "use > utf8" IF the string in question is enclosed in single quotes. For double quoted > strings the warning is issued as expected: > > % blead -C0 -le 'print qq(print "\xB0C";)' | blead -Mutf8 -CS -l > Malformed UTF-8 character (unexpected continuation byte 0xb0, with no preceding start byte) at - line 1. > C > > % blead -C0 -le 'print qq(print \x27\xB0C\x27;)' | blead -Mutf8 -CS -l > #C > > This should be fixed so that the warning is issued for single quoted strings as > well, helping to detect incompletely/incorrectly converted scripts. > > ----------------------------------------------------------------- > --- > Flags: > category=core > severity=medium > --- > Site configuration information for perl 5.20.2: > > Configured by Debian Project at Sun May 3 16:16:25 UTC 2015. > > Summary of my perl5 (revision 5 version 20 subversion 2) configuration: > > Platform: > osname=linux, osvers=3.2.0-4-amd64, archname=x86_64-linux-gnu-thread-multi > uname='linux x86-csail-01 3.2.0-4-amd64 #1 smp debian 3.2.68-1+deb7u1 x86_64 gnulinux ' > config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des' > hint=recommended, useposix=true, d_sigaction=define > useithreads=define, usemultiplicity=define > use64bitint=define, use64bitall=define, uselongdouble=undef > usemymalloc=n, bincompat5005=undef > Compiler: > cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', > optimize='-O2 -g', > cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include' > ccversion='', gccversion='4.9.2', gccosandvers='' > intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 > d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 > ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 > alignbytes=8, prototype=define > Linker and Libraries: > ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' > libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib > libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt > perllibs=-ldl -lm -lpthread -lc -lcrypt > libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20 > gnulibc_version='2.19' > Dynamic Linking: > dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' > cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector' > > Locally applied patches: > DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. > DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. > DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. > DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. > DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. > DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking > DEBPKG:fixes/respect_umask - Respect umask during installation > DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories > DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib > DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor > DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile > DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. > DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. > DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. > DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. > DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian > DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy > DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. > DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option > DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local > DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules > DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts > DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository > DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3+deb8u1 in patchlevel.h > DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD > DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} > DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text > DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl > DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable > DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected > DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories > DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug > DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories > DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences > DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer > DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle > DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test > DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd > DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling > DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require > DEBPKG:fixes/array-cloning - http://bugs.debian.org/779357 [perl #124127] [902d169] fix cloning arrays with unused elements > DEBPKG:fixes/perldb-threads - http://bugs.debian.org/779357 [perl #124127] [41ef2c6] lib/perl5db.pl: Restore noop lock prototype > > --- > @INC for perl 5.20.2: > /etc/perl > /usr/local/lib/x86_64-linux-gnu/perl/5.20.2 > /usr/local/share/perl/5.20.2 > /usr/lib/x86_64-linux-gnu/perl5/5.20 > /usr/share/perl5 > /usr/lib/x86_64-linux-gnu/perl/5.20 > /usr/share/perl/5.20 > /usr/local/lib/site_perl > . > > --- > Environment for perl 5.20.2: > HOME=/home/fschlich > LANG=de_DE@euro > LANGUAGE (unset) > LC_CTYPE=de_DE@euro > LC_MESSAGES=C > LD_LIBRARY_PATH (unset) > LOGDIR (unset) > PATH=/home/fschlich/bin:/home/fschlich/bin:/usr/local/sw/i3-jessie/bin:/usr/local/sw/xfce/stable/bin:/usr/local/bin:/usr/local/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/games > PERL_BADLANG (unset) > SHELL=/bin/zsh > >
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 188b
I intend to fix this, unless the consensus is to not. It involves extra work in the parser of doing a UTF-8 validity check when appropriate on single-quoted strings. -- Karl Williamson
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 648b
On Tue Aug 02 19:58:51 2016, khw wrote: Show quoted text
> I intend to fix this, unless the consensus is to not. It involves > extra work in the parser of doing a UTF-8 validity check when > appropriate on single-quoted strings.
If you mean in tokeq or scan_str, I think that’s the wrong place to do it. It sounds as though eval "'...'" will be subject to such extra checks as well, but it is perfectly reasonable to assume that perl strings are already well-formed. Ideally, under ‘use utf8’, the validation would be done when the input is read from a stream, though I can’t say offhand what is the best way to go about that. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 773b
On Tue Aug 02 20:05:11 2016, sprout wrote: Show quoted text
> On Tue Aug 02 19:58:51 2016, khw wrote:
> > I intend to fix this, unless the consensus is to not. It involves > > extra work in the parser of doing a UTF-8 validity check when > > appropriate on single-quoted strings.
> > If you mean in tokeq or scan_str, I think that’s the wrong place to do > it. It sounds as though eval "'...'" will be subject to such extra > checks as well, but it is perfectly reasonable to assume that perl > strings are already well-formed. > > Ideally, under ‘use utf8’, the validation would be done when the input > is read from a stream, though I can’t say offhand what is the best way > to go about that.
Probably in Perl_lex_next_chunk or something it calls. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 883b
On Tue Aug 02 20:09:15 2016, sprout wrote: Show quoted text
> On Tue Aug 02 20:05:11 2016, sprout wrote:
> > On Tue Aug 02 19:58:51 2016, khw wrote:
> > > I intend to fix this, unless the consensus is to not. It involves > > > extra work in the parser of doing a UTF-8 validity check when > > > appropriate on single-quoted strings.
> > > > If you mean in tokeq or scan_str, I think that’s the wrong place to do > > it. It sounds as though eval "'...'" will be subject to such extra > > checks as well, but it is perfectly reasonable to assume that perl > > strings are already well-formed. > > > > Ideally, under ‘use utf8’, the validation would be done when the input > > is read from a stream, though I can’t say offhand what is the best way > > to go about that.
> > Probably in Perl_lex_next_chunk or something it calls. >
Is the attach3ed like what you mean? -- Karl Williamson
Subject: 0001-Proof-of-concept-to-test-input-for-valid-UTF-8.patch
From c1d8cbda01e0b2f372e9341efeb4e306ec0c043d Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@cpan.org> Date: Wed, 31 Aug 2016 21:31:28 -0600 Subject: [PATCH] Proof-of-concept to test input for valid UTF-8. This will fix #126310, and heaven knows what else. I think we should die at the first malformed UTF-8 encountered in parsing. To try to continue is asking for trouble, and not going to be DWIM anyway. --- toke.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/toke.c b/toke.c index dbeecd1..eddfb29 100644 --- a/toke.c +++ b/toke.c @@ -1339,6 +1339,11 @@ Perl_lex_next_chunk(pTHX_ U32 flags) new_bufend_pos = SvCUR(linestr); PL_parser->bufend = buf + new_bufend_pos; PL_parser->bufptr = buf + bufptr_pos; + + if (UTF && ! is_utf8_string((U8 *) PL_parser->bufptr, PL_parser->bufend - PL_parser->bufptr)) { + Perl_croak(aTHX_ "Malformed utf8"); + } + PL_parser->oldbufptr = buf + oldbufptr_pos; PL_parser->oldoldbufptr = buf + oldoldbufptr_pos; PL_parser->linestart = buf + linestart_pos; -- 2.5.0
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 334b
On Wed Aug 31 20:35:02 2016, khw wrote: Show quoted text
> Is the attach3ed like what you mean?
Yes, that would work. It would be nice, too, if we could add the ‘near such and such’ that yyerror normally does. Maybe yyerror could have an extra option to croak instead of calling qerror. It already has a flags field. -- Father Chrysostomos
Subject: [perl #126310] no "Malformed UTF-8 character" warning on single-quoted strings under "use utf8"
From: Florian Schlichting <florian.schlichting [...] fu-berlin.de>
To: RT-Ticket-126310 [...] perl.org, perl5-porters [...] perl.org, khw [...] cpan.org
Date: Fri, 16 Sep 2016 14:46:59 +0200
Download (untitled) / with headers
text/plain 1.1k
Hi Karl, Father Chrysostomos wrote: Show quoted text
> On Wed Aug 31 20:35:02 2016, khw wrote:
>> Is the attach3ed like what you mean?
> > Yes, that would work. > > It would be nice, too, if we could add the `near such and such' that > yyerror normally does. Maybe yyerror could have an extra option to croak > instead of calling qerror. It already has a flags field.
thanks for looking into this issue. I tested your patch and can confirm that it correctly treats single and double quotes the same: % ./perl -C0 -le 'print qq(print "\xB0C";)' | ./perl -I'lib' -Mutf8 -CS % -l Malformed utf8 at - line 1. % ./perl -C0 -le 'print qq(print \x27\xB0C\x27;)' | ./perl -I'lib' -Mutf8 -CS -l Malformed utf8 at - line 1. However, I feel a little uneasy about dying altogether. Currently Perl issues just a warning ("Malformed UTF-8 character") and that seems to be the approach with UTF-8 issues encountered in other places in toke.c as well. Most of the time, these will be strings displayed to the user, and they will mostly still be legible even with a few characters garbled or skipped. Don't you think "complain and carry on" is what users would expect? Florian
Date: Fri, 16 Sep 2016 14:34:29 -0600
To: Florian Schlichting <florian.schlichting [...] fu-berlin.de>, RT-Ticket-126310 [...] perl.org, perl5-porters [...] perl.org
Subject: Re: [perl #126310] no "Malformed UTF-8 character" warning on single-quoted strings under "use utf8"
From: Karl Williamson <khw [...] cpan.org>
Download (untitled) / with headers
text/plain 1.4k
On 09/16/2016 06:46 AM, Florian Schlichting wrote: Show quoted text
> Hi Karl, > > Father Chrysostomos wrote:
>> On Wed Aug 31 20:35:02 2016, khw wrote:
>>> Is the attach3ed like what you mean?
>> >> Yes, that would work. >> >> It would be nice, too, if we could add the `near such and such' that >> yyerror normally does. Maybe yyerror could have an extra option to croak >> instead of calling qerror. It already has a flags field.
> > thanks for looking into this issue. I tested your patch and can confirm > that it correctly treats single and double quotes the same: > > % ./perl -C0 -le 'print qq(print "\xB0C";)' | ./perl -I'lib' -Mutf8 -CS % -l > Malformed utf8 at - line 1. > > % ./perl -C0 -le 'print qq(print \x27\xB0C\x27;)' | ./perl -I'lib' -Mutf8 -CS -l > Malformed utf8 at - line 1. > > > However, I feel a little uneasy about dying altogether. Currently Perl > issues just a warning ("Malformed UTF-8 character") and that seems to be > the approach with UTF-8 issues encountered in other places in toke.c as > well. Most of the time, these will be strings displayed to the user, and > they will mostly still be legible even with a few characters garbled or > skipped. Don't you think "complain and carry on" is what users would > expect? > > Florian > >
But we are running into segfaults because of trying to keep going in the face of malformed UTF-8. I'm thinking the lesson should be to give up when we find it, and this is a reasonable place to start. There are places where malformed UTF-8 is fatal.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.7k
On Fri Sep 16 13:34:55 2016, khw wrote: Show quoted text
> On 09/16/2016 06:46 AM, Florian Schlichting wrote:
> > Hi Karl, > > > > Father Chrysostomos wrote:
> >> On Wed Aug 31 20:35:02 2016, khw wrote:
> >>> Is the attach3ed like what you mean?
> >> > >> Yes, that would work. > >> > >> It would be nice, too, if we could add the `near such and such' that > >> yyerror normally does. Maybe yyerror could have an extra option to > >> croak > >> instead of calling qerror. It already has a flags field.
> > > > thanks for looking into this issue. I tested your patch and can > > confirm > > that it correctly treats single and double quotes the same: > > > > % ./perl -C0 -le 'print qq(print "\xB0C";)' | ./perl -I'lib' -Mutf8 > > -CS % -l > > Malformed utf8 at - line 1. > > > > % ./perl -C0 -le 'print qq(print \x27\xB0C\x27;)' | ./perl -I'lib' > > -Mutf8 -CS -l > > Malformed utf8 at - line 1. > > > > > > However, I feel a little uneasy about dying altogether. Currently > > Perl > > issues just a warning ("Malformed UTF-8 character") and that seems to > > be > > the approach with UTF-8 issues encountered in other places in toke.c > > as > > well. Most of the time, these will be strings displayed to the user, > > and > > they will mostly still be legible even with a few characters garbled > > or > > skipped. Don't you think "complain and carry on" is what users would > > expect? > > > > Florian > > > >
> > But we are running into segfaults because of trying to keep going in > the > face of malformed UTF-8. I'm thinking the lesson should be to give up > when we find it, and this is a reasonable place to start. There are > places where malformed UTF-8 is fatal.
I agree. If perl keeps going, then even if it does not crash, it will die on those malformed strings later. -- Father Chrysostomos
Date: Thu, 13 Oct 2016 12:49:01 -0600
From: Karl Williamson <public [...] khwilliamson.com>
CC: perl5-porters [...] perl.org
To: perlbug-followup [...] perl.org
Subject: Re: [perl #126310] no "Malformed UTF-8 character" warning on single-quoted strings under "use utf8"
On 09/16/2016 04:44 PM, Father Chrysostomos via RT wrote: Show quoted text
> On Fri Sep 16 13:34:55 2016, khw wrote:
>> On 09/16/2016 06:46 AM, Florian Schlichting wrote:
>>> Hi Karl, >>> >>> Father Chrysostomos wrote:
>>>> On Wed Aug 31 20:35:02 2016, khw wrote:
>>>>> Is the attach3ed like what you mean?
>>>> >>>> Yes, that would work. >>>> >>>> It would be nice, too, if we could add the `near such and such' that >>>> yyerror normally does. Maybe yyerror could have an extra option to >>>> croak >>>> instead of calling qerror. It already has a flags field.
>>> >>> thanks for looking into this issue. I tested your patch and can >>> confirm >>> that it correctly treats single and double quotes the same: >>> >>> % ./perl -C0 -le 'print qq(print "\xB0C";)' | ./perl -I'lib' -Mutf8 >>> -CS % -l >>> Malformed utf8 at - line 1. >>> >>> % ./perl -C0 -le 'print qq(print \x27\xB0C\x27;)' | ./perl -I'lib' >>> -Mutf8 -CS -l >>> Malformed utf8 at - line 1. >>> >>> >>> However, I feel a little uneasy about dying altogether. Currently >>> Perl >>> issues just a warning ("Malformed UTF-8 character") and that seems to >>> be >>> the approach with UTF-8 issues encountered in other places in toke.c >>> as >>> well. Most of the time, these will be strings displayed to the user, >>> and >>> they will mostly still be legible even with a few characters garbled >>> or >>> skipped. Don't you think "complain and carry on" is what users would >>> expect? >>> >>> Florian >>> >>>
>> >> But we are running into segfaults because of trying to keep going in >> the >> face of malformed UTF-8. I'm thinking the lesson should be to give up >> when we find it, and this is a reasonable place to start. There are >> places where malformed UTF-8 is fatal.
> > I agree. If perl keeps going, then even if it does not crash, it will die on those malformed strings later. >
blead now has improved diagnostics for when malformations occur. I am thinking that these should be turned on unconditionally when this error occurs, as we are going to immediately die anyway Any opposition?
RT-Send-CC: perl5-porters [...] perl.org
This has been fixed in blead by 6cdc5cd8f36f88172b0fcefdcadec75f5b6600b2 -- Karl Williamson
Download (untitled) / with headers
text/plain 313b
Thank you for filing this report. You have helped make Perl better. With the release today of Perl 5.26.0, this and 210 other issues have been resolved. Perl 5.26.0 may be downloaded via: https://metacpan.org/release/XSAWYERX/perl-5.26.0 If you find that the problem persists, feel free to reopen this ticket.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org