Skip Menu |
Report information
Id: 131190
Status: open
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: saint.snit [at] gmail.com
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Date: Fri, 21 Apr 2017 18:13:51 -0500
Subject: erroneous regex warning after utf8 conversion
From: saint.snit [...] gmail.com
To: perlbug [...] perl.org
Download (untitled) / with headers
text/plain 6.8k
This is a bug report for perl from saint.snit@gmail.com, generated with the help of perlbug 1.40 running under perl 5.22.2. ----------------------------------------------------------------- Perl emits an inapplicable warning for some regular expressions. Here is a code block that demonstrates the bug: use experimental 'smartmatch'; use POSIX 'locale_h'; use locale ':ctype'; setlocale(LC_CTYPE, 'en_US'); $_ = "x"; utf8::upgrade($_); /x(y|z)?/; This produces the output: Wide character (U+FFFD) in pattern match (m//) at - line 8. even though no such character is ever specified. The bug only seems to occur with the particular combination of "use" statements above; for instance, it does not occur if the "smartmatch" line is omitted, even though the code uses no smart matching. (For reference, discussion about this bug originated at http://perlmonks.com/?node_id=1188156) Thank you for your time. ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl 5.22.2: Configured by Gentoo at Thu Oct 20 22:32:43 CDT 2016. Summary of my perl5 (revision 5 version 22 subversion 2) configuration: Platform: osname=linux, osvers=4.4.21-gentoo-evo, archname=i686-linux uname='linux evo 4.4.21-gentoo-evo #1 smp sun oct 9 21:43:27 cdt 2016 i686 intel(r) pentium(r) 4 cpu 1.70ghz genuineintel gnulinux ' config_args='-des -Duseshrplib -Darchname=i686-linux -Dcc=i686-pc-linux-gnu-gcc -Doptimize=-O2 -march=native -pipe -Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr -Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin -Dprivlib=/usr/lib/perl5/5.22.2 -Darchlib=/usr/lib/perl5/5.22.2/i686-linux -Dsitelib=/usr/local/lib/perl5/5.22.2 -Dsitearch=/usr/local/lib/perl5/5.22.2/i686-linux -Dvendorlib=/usr/lib/perl5/vendor_perl/5.22.2 -Dvendorarch=/usr/lib/perl5/vendor_perl/5.22.2/i686-linux -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3 -Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.22.2 -Dlocincpth=/usr/include -Dglibpth=/lib /usr/lib -Duselargefiles -Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost -Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm -Di_gdbm -Di_db -DDEBUGGING=none -Dinc_version_list=5.22.0/i686-linux 5.22.0 5.22.1/i686-linux 5.22.1 -Dnoextensions=ODBM_File' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='i686-pc-linux-gnu-gcc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -march=native -pipe', cppflags='-fwrapv -fno-strict-aliasing -pipe' ccversion='', gccversion='4.9.3', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234, doublekind=3 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12, longdblkind=3 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='i686-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed' libpth=/usr/lib/gcc/i686-pc-linux-gnu/4.9.3/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so.5.22.2 gnulibc_version='2.22' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -march=native -pipe -Wl,-O1 -Wl,--as-needed' Locally applied patches: gentoo/hints_hpux - Fix hpux hints gentoo/aix_soname - aix gcc detection and shared library soname support gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054 cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags gentoo/opensolaris_headers - [PATCH] Add headers for opensolaris gentoo/patchlevel - List packaged patches for perl-5.22.2(#1) in patchlevel.h gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site directories by default. gentoo/cleanup-paths - [PATCH] Cleanup PATH and shrpenv gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing @INC directories. gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 [PATCH] darwin: Use $CC to link gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. gentoo/interix - [PATCH] Fix interix hints gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 [PATCH] Set libperl soname gentoo/mod_paths - Add /etc/perl to @INC gentoo/EUMM_perllocalpod - gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 [PATCH] Don't force -fstack-protector on everyone gentoo/usr_local - [PATCH] Remove /usr/local paths gentoo/D-SHA-CFLAGS - https://bugs.gentoo.org/506818 [PATCH] [PATCH] Do not set custom CFLAGS in cpan/Digest-SHA gentoo/io_socket_ip_tests - gentoo/tests.patch - debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable debian/makemaker-pasthru - Pass LD settings through to subdirectories debian/locale-robustness - [perl #124310] Make t/run/locale.t survive missing locales masked by LC_ALL fixes/memoize_storable_nstore - [rt.cpan.org #77790] Memoize::Storable: respect 'nstore' option not respected fixes/podman-pipe - Better errors for man pages from standard input fixes/respect_umask - Respect umask during installation fixes/podman-utc - Make the embedded date from Pod::Man reproducible fixes/podman-utc-docs - Documentation and test suite updates for UTC fix fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP 'Port' option fixes/document_makemaker_ccflags - [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags} --- @INC for perl 5.22.2: /etc/perl /usr/local/lib/perl5/5.22.2/i686-linux /usr/local/lib/perl5/5.22.2 /usr/lib/perl5/vendor_perl/5.22.2/i686-linux /usr/lib/perl5/vendor_perl/5.22.2 /usr/local/lib/perl5 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.22.2/i686-linux /usr/lib/perl5/5.22.2 . --- Environment for perl 5.22.2: HOME=/home/vax LANG=C LANGUAGE (unset) LC_CTYPE=en_US.iso88591 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/vax/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.9.3:/sbin:/usr/sbin:/usr/games/bin:./bin PERL_BADLANG (unset) SHELL=/bin/bash
Date: Sun, 23 Apr 2017 20:33:12 +0100
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: perl5-porters [...] perl.org
From: Zefram <zefram [...] fysh.org>
Bisecting shows that the warning started appearing for that test script at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte locale". Attempting to minimise the test script, it turns out that the "use experimental" line is not required for any reason relating to smartmatch, but simply for its effect on lexical warning flags. Anything touching lexical warnings will do, such as the simpler "use warnings". And thus enabling all warnings produces an additional warning that sheds some light on the matter: $ perl ../rt131190 Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8. Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8. Looks like the problem is that the check for wide characters should be passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end of string it perceives a malformed character, for which it warns about malformation and substitutes in a replacement character, which is wide and therefore triggers the wide character warning. -zefram
Date: Sun, 23 Apr 2017 22:19:41 -0600
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: Zefram <zefram [...] fysh.org>, perl5-porters [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.2k
On 04/23/2017 01:33 PM, Zefram wrote: Show quoted text
> Bisecting shows that the warning started appearing for that test script > at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte > locale". > > Attempting to minimise the test script, it turns out that the "use > experimental" line is not required for any reason relating to smartmatch, > but simply for its effect on lexical warning flags. Anything touching > lexical warnings will do, such as the simpler "use warnings". And thus > enabling all warnings produces an additional warning that sheds some > light on the matter: > > $ perl ../rt131190 > Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8. > Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8. > > Looks like the problem is that the check for wide characters should be > passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end > of string it perceives a malformed character, for which it warns about > malformation and substitutes in a replacement character, which is wide > and therefore triggers the wide character warning. >
Actually, the decode function shouldn't be getting called at all if there is nothing to decode. Show quoted text
> -zefram >
To: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
From: demerphq <demerphq [...] gmail.com>
CC: Zefram <zefram [...] fysh.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Date: Mon, 24 Apr 2017 09:20:41 +0200
Download (untitled) / with headers
text/plain 2.8k
On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> On 04/23/2017 01:33 PM, Zefram wrote:
>> >> Bisecting shows that the warning started appearing for that test script >> at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte >> locale". >> >> Attempting to minimise the test script, it turns out that the "use >> experimental" line is not required for any reason relating to smartmatch, >> but simply for its effect on lexical warning flags. Anything touching >> lexical warnings will do, such as the simpler "use warnings". And thus >> enabling all warnings produces an additional warning that sheds some >> light on the matter: >> >> $ perl ../rt131190 >> Malformed UTF-8 character (empty string) in pattern match (m//) at >> ../rt131190 line 8. >> Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8. >> >> Looks like the problem is that the check for wide characters should be >> passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end >> of string it perceives a malformed character, for which it warns about >> malformation and substitutes in a replacement character, which is wide >> and therefore triggers the wide character warning. >>
> > Actually, the decode function shouldn't be getting called at all if there is > nothing to decode.
I pushed a fix for this yesterday. Dave M changed the var "nextchr" so that it supported a negative argument beyond its normal duties as holding the codepoint of the nextchar in the regex matching. This negative argument, -10, is used to represent the end of string. (All very reasonable.) Unfortunately this doesn't play entirely nicely with the utf8 code which for other generally good reasons tends to cast its arguments to U8 (to deal with "char" inputs). So the code: if (utf8_target && UTF8_IS_ABOVE_LATIN1(nextchr) sees the -10, casts it to a U8, producing 255, and then considers it to be the first octet of a (malformed presumably) utf8 sequence. The fix was to change this to if (utf8_target && nextchr >= 0 && UTF8_IS_ABOVE_LATIN1(nextchr) I was going to push this to a branch etc, but I managed to make a big mess of things yesterday and pushed it to trunk along with a bunch of other junk stuff which was meant for after the code freeze. After running out of time cleaning up the stuff that definitely shouldnt be part of this next release, and being somewhat unwilling to allow a perl to be released with such an obvious bug and obviously safe bugfix unapplied I decided to leave it for today to sort out. So this issues is fixed and understood, but the exact status of the patch is a bit up in the air. If sawyer says to revert it then I will, but I feel this patch is safe enough to be part of the release. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Date: Wed, 26 Apr 2017 10:16:40 -0600
CC: Zefram <zefram [...] fysh.org>, Perl5 Porteros <perl5-porters [...] perl.org>
From: Karl Williamson <public [...] khwilliamson.com>
To: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
On 04/24/2017 01:20 AM, demerphq wrote: Show quoted text
> On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com> wrote:
>> On 04/23/2017 01:33 PM, Zefram wrote:
>>> >>> Bisecting shows that the warning started appearing for that test script >>> at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte >>> locale". >>> >>> Attempting to minimise the test script, it turns out that the "use >>> experimental" line is not required for any reason relating to smartmatch, >>> but simply for its effect on lexical warning flags. Anything touching >>> lexical warnings will do, such as the simpler "use warnings". And thus >>> enabling all warnings produces an additional warning that sheds some >>> light on the matter: >>> >>> $ perl ../rt131190 >>> Malformed UTF-8 character (empty string) in pattern match (m//) at >>> ../rt131190 line 8. >>> Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8. >>> >>> Looks like the problem is that the check for wide characters should be >>> passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end >>> of string it perceives a malformed character, for which it warns about >>> malformation and substitutes in a replacement character, which is wide >>> and therefore triggers the wide character warning. >>>
>> >> Actually, the decode function shouldn't be getting called at all if there is >> nothing to decode.
> > I pushed a fix for this yesterday. > > Dave M changed the var "nextchr" so that it supported a negative > argument beyond its normal duties as holding the codepoint of the > nextchar in the regex matching. This negative argument, -10, is used > to represent the end of string. (All very reasonable.) > > Unfortunately this doesn't play entirely nicely with the utf8 code > which for other generally good reasons tends to cast its arguments to > U8 (to deal with "char" inputs). So the code: > > if (utf8_target > && UTF8_IS_ABOVE_LATIN1(nextchr) > > sees the -10, casts it to a U8, producing 255, and then considers it > to be the first octet of a (malformed presumably) utf8 sequence. > > The fix was to change this to > > > if (utf8_target > && nextchr >= 0 > && UTF8_IS_ABOVE_LATIN1(nextchr) > > I was going to push this to a branch etc, but I managed to make a big > mess of things yesterday and pushed it to trunk along with a bunch of > other junk stuff which was meant for after the code freeze. > > After running out of time cleaning up the stuff that definitely > shouldnt be part of this next release, and being somewhat unwilling to > allow a perl to be released with such an obvious bug and obviously > safe bugfix unapplied I decided to leave it for today to sort out. > > So this issues is fixed and understood, but the exact status of the > patch is a bit up in the air. If sawyer says to revert it then I will, > but I feel this patch is safe enough to be part of the release. > > Yves >
Note that no test for this problem has been committed.
Date: Wed, 26 Apr 2017 18:52:20 +0200
From: demerphq <demerphq [...] gmail.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
To: karl williamson <public [...] khwilliamson.com>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
Download (untitled) / with headers
text/plain 3.3k
No. Because we are debating if it should be reverted due to code freeze. I personally think this is sufficiently isolated we should keep it. I am waiting on sawyer to make a call... 

Yves

On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com> wrote:
Show quoted text
On 04/24/2017 01:20 AM, demerphq wrote:
On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com> wrote:
On 04/23/2017 01:33 PM, Zefram wrote:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags.  Anything touching
lexical warnings will do, such as the simpler "use warnings".  And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag.  Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.


Actually, the decode function shouldn't be getting called at all if there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code:

                     if (utf8_target
                        && UTF8_IS_ABOVE_LATIN1(nextchr)

sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to


                     if (utf8_target
                        && nextchr >= 0
                        && UTF8_IS_ABOVE_LATIN1(nextchr)

I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves


Note that no test for this problem has been committed.
Date: Wed, 26 Apr 2017 11:15:25 -0600
CC: Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
From: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.2k
On 04/26/2017 10:52 AM, demerphq wrote: Show quoted text
> No. Because we are debating if it should be reverted due to code freeze. > I personally think this is sufficiently isolated we should keep it. I am > waiting on sawyer to make a call... > > Yves
Right, but if we keep it, shouldn't we also push a test? Note that this is a regression introduced in 5.25. Show quoted text
> > On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com > <mailto:public@khwilliamson.com>> wrote: > > On 04/24/2017 01:20 AM, demerphq wrote: > > On 24 April 2017 at 06:19, Karl Williamson > <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: > > On 04/23/2017 01:33 PM, Zefram wrote: > > > Bisecting shows that the warning started appearing for > that test script > at v5.21.7-165-g613abc6 "Raise warning on multi-byte > char in single-byte > locale". > > Attempting to minimise the test script, it turns out > that the "use > experimental" line is not required for any reason > relating to smartmatch, > but simply for its effect on lexical warning flags. > Anything touching > lexical warnings will do, such as the simpler "use > warnings". And thus > enabling all warnings produces an additional warning > that sheds some > light on the matter: > > $ perl ../rt131190 > Malformed UTF-8 character (empty string) in pattern > match (m//) at > ../rt131190 line 8. > Wide character (U+FFFD) in pattern match (m//) at > ../rt131190 line 8. > > Looks like the problem is that the check for wide > characters should be > passing in the UTF8_ALLOW_EMPTY flag. Without this, > when it's at end > of string it perceives a malformed character, for which > it warns about > malformation and substitutes in a replacement character, > which is wide > and therefore triggers the wide character warning. > > > Actually, the decode function shouldn't be getting called at > all if there is > nothing to decode. > > > I pushed a fix for this yesterday. > > Dave M changed the var "nextchr" so that it supported a negative > argument beyond its normal duties as holding the codepoint of the > nextchar in the regex matching. This negative argument, -10, is used > to represent the end of string. (All very reasonable.) > > Unfortunately this doesn't play entirely nicely with the utf8 code > which for other generally good reasons tends to cast its > arguments to > U8 (to deal with "char" inputs). So the code: > > if (utf8_target > && UTF8_IS_ABOVE_LATIN1(nextchr) > > sees the -10, casts it to a U8, producing 255, and then considers it > to be the first octet of a (malformed presumably) utf8 sequence. > > The fix was to change this to > > > if (utf8_target > && nextchr >= 0 > && UTF8_IS_ABOVE_LATIN1(nextchr) > > I was going to push this to a branch etc, but I managed to make > a big > mess of things yesterday and pushed it to trunk along with a > bunch of > other junk stuff which was meant for after the code freeze. > > After running out of time cleaning up the stuff that definitely > shouldnt be part of this next release, and being somewhat > unwilling to > allow a perl to be released with such an obvious bug and obviously > safe bugfix unapplied I decided to leave it for today to sort out. > > So this issues is fixed and understood, but the exact status of the > patch is a bit up in the air. If sawyer says to revert it then I > will, > but I feel this patch is safe enough to be part of the release. > > Yves > > > Note that no test for this problem has been committed. >
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: karl williamson <public [...] khwilliamson.com>
Date: Wed, 26 Apr 2017 23:37:43 +0200
CC: Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 742b


On 26 Apr 2017 7:21 p.m., "Karl Williamson" <public@khwilliamson.com> wrote:
Show quoted text
On 04/26/2017 10:52 AM, demerphq wrote:
No. Because we are debating if it should be reverted due to code freeze.
I personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

Right, but if we keep it, shouldn't we also push a test?

Note that this is a regression introduced in 5.25

I already messed up the change log with a bunch of reverts. Id rather not add another patch if it's just going to be reverted. So if sawyer approves this change despite the code freeze then we can add a test, otherwise let's have less to revert. 

But one way or another we will need a test for sure. ☺️

Yves
Show quoted text
 
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: demerphq <demerphq [...] gmail.com>, karl williamson <public [...] khwilliamson.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
From: Sawyer X <xsawyerx [...] gmail.com>
Date: Thu, 27 Apr 2017 13:13:09 +0200
Download (untitled) / with headers
text/plain 4.4k
Just for the official record (and relating to James' email), I'm still looking into whether to keep it or not. (I'd rather discuss it with Yves instead of making a single-sided decision.) Everyone, feel free to provide your opinion. Just keep it polite. :) On 04/26/2017 06:52 PM, demerphq wrote: Show quoted text
> No. Because we are debating if it should be reverted due to code > freeze. I personally think this is sufficiently isolated we should > keep it. I am waiting on sawyer to make a call... > > Yves > > On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com > <mailto:public@khwilliamson.com>> wrote: > > On 04/24/2017 01:20 AM, demerphq wrote: > > On 24 April 2017 at 06:19, Karl Williamson > <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: > > On 04/23/2017 01:33 PM, Zefram wrote: > > > Bisecting shows that the warning started appearing for > that test script > at v5.21.7-165-g613abc6 "Raise warning on multi-byte > char in single-byte > locale". > > Attempting to minimise the test script, it turns out > that the "use > experimental" line is not required for any reason > relating to smartmatch, > but simply for its effect on lexical warning flags. > Anything touching > lexical warnings will do, such as the simpler "use > warnings". And thus > enabling all warnings produces an additional warning > that sheds some > light on the matter: > > $ perl ../rt131190 > Malformed UTF-8 character (empty string) in pattern > match (m//) at > ../rt131190 line 8. > Wide character (U+FFFD) in pattern match (m//) at > ../rt131190 line 8. > > Looks like the problem is that the check for wide > characters should be > passing in the UTF8_ALLOW_EMPTY flag. Without this, > when it's at end > of string it perceives a malformed character, for > which it warns about > malformation and substitutes in a replacement > character, which is wide > and therefore triggers the wide character warning. > > > Actually, the decode function shouldn't be getting called > at all if there is > nothing to decode. > > > I pushed a fix for this yesterday. > > Dave M changed the var "nextchr" so that it supported a negative > argument beyond its normal duties as holding the codepoint of the > nextchar in the regex matching. This negative argument, -10, > is used > to represent the end of string. (All very reasonable.) > > Unfortunately this doesn't play entirely nicely with the utf8 code > which for other generally good reasons tends to cast its > arguments to > U8 (to deal with "char" inputs). So the code: > > if (utf8_target > && UTF8_IS_ABOVE_LATIN1(nextchr) > > sees the -10, casts it to a U8, producing 255, and then > considers it > to be the first octet of a (malformed presumably) utf8 sequence. > > The fix was to change this to > > > if (utf8_target > && nextchr >= 0 > && UTF8_IS_ABOVE_LATIN1(nextchr) > > I was going to push this to a branch etc, but I managed to > make a big > mess of things yesterday and pushed it to trunk along with a > bunch of > other junk stuff which was meant for after the code freeze. > > After running out of time cleaning up the stuff that definitely > shouldnt be part of this next release, and being somewhat > unwilling to > allow a perl to be released with such an obvious bug and obviously > safe bugfix unapplied I decided to leave it for today to sort out. > > So this issues is fixed and understood, but the exact status > of the > patch is a bit up in the air. If sawyer says to revert it then > I will, > but I feel this patch is safe enough to be part of the release. > > Yves > > > Note that no test for this problem has been committed. >
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: Sawyer X <xsawyerx [...] gmail.com>, demerphq <demerphq [...] gmail.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
From: Karl Williamson <public [...] khwilliamson.com>
Date: Thu, 27 Apr 2017 11:35:53 -0600
Download (untitled) / with headers
text/plain 5.2k
On 04/27/2017 05:13 AM, Sawyer X wrote: Show quoted text
> Just for the official record (and relating to James' email), I'm still > looking into whether to keep it or not. > > (I'd rather discuss it with Yves instead of making a single-sided decision.) > > Everyone, feel free to provide your opinion. Just keep it polite. :)
I think the patch should stay. It is extremely low risk, and fixes a regression and hence would be eligible for 5.26.1 anyway. But, I want to point out that this highlights a flaw in the perl implementation. Here, a negative number is being used as a sentinel. But in most of the implementation, code points can be any UV. If a user has just the right code point point it would be treated as negative and conflated with this sentinel. There are other situations where negatives are similarly used. Because of this, code points outside the range 0..IV_MAX are deprecated, and are scheduled to be forbidden in 5.28. Show quoted text
> > > On 04/26/2017 06:52 PM, demerphq wrote:
>> No. Because we are debating if it should be reverted due to code >> freeze. I personally think this is sufficiently isolated we should >> keep it. I am waiting on sawyer to make a call... >> >> Yves >> >> On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com >> <mailto:public@khwilliamson.com>> wrote: >> >> On 04/24/2017 01:20 AM, demerphq wrote: >> >> On 24 April 2017 at 06:19, Karl Williamson >> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: >> >> On 04/23/2017 01:33 PM, Zefram wrote: >> >> >> Bisecting shows that the warning started appearing for >> that test script >> at v5.21.7-165-g613abc6 "Raise warning on multi-byte >> char in single-byte >> locale". >> >> Attempting to minimise the test script, it turns out >> that the "use >> experimental" line is not required for any reason >> relating to smartmatch, >> but simply for its effect on lexical warning flags. >> Anything touching >> lexical warnings will do, such as the simpler "use >> warnings". And thus >> enabling all warnings produces an additional warning >> that sheds some >> light on the matter: >> >> $ perl ../rt131190 >> Malformed UTF-8 character (empty string) in pattern >> match (m//) at >> ../rt131190 line 8. >> Wide character (U+FFFD) in pattern match (m//) at >> ../rt131190 line 8. >> >> Looks like the problem is that the check for wide >> characters should be >> passing in the UTF8_ALLOW_EMPTY flag. Without this, >> when it's at end >> of string it perceives a malformed character, for >> which it warns about >> malformation and substitutes in a replacement >> character, which is wide >> and therefore triggers the wide character warning. >> >> >> Actually, the decode function shouldn't be getting called >> at all if there is >> nothing to decode. >> >> >> I pushed a fix for this yesterday. >> >> Dave M changed the var "nextchr" so that it supported a negative >> argument beyond its normal duties as holding the codepoint of the >> nextchar in the regex matching. This negative argument, -10, >> is used >> to represent the end of string. (All very reasonable.) >> >> Unfortunately this doesn't play entirely nicely with the utf8 code >> which for other generally good reasons tends to cast its >> arguments to >> U8 (to deal with "char" inputs). So the code: >> >> if (utf8_target >> && UTF8_IS_ABOVE_LATIN1(nextchr) >> >> sees the -10, casts it to a U8, producing 255, and then >> considers it >> to be the first octet of a (malformed presumably) utf8 sequence. >> >> The fix was to change this to >> >> >> if (utf8_target >> && nextchr >= 0 >> && UTF8_IS_ABOVE_LATIN1(nextchr) >> >> I was going to push this to a branch etc, but I managed to >> make a big >> mess of things yesterday and pushed it to trunk along with a >> bunch of >> other junk stuff which was meant for after the code freeze. >> >> After running out of time cleaning up the stuff that definitely >> shouldnt be part of this next release, and being somewhat >> unwilling to >> allow a perl to be released with such an obvious bug and obviously >> safe bugfix unapplied I decided to leave it for today to sort out. >> >> So this issues is fixed and understood, but the exact status >> of the >> patch is a bit up in the air. If sawyer says to revert it then >> I will, >> but I feel this patch is safe enough to be part of the release. >> >> Yves >> >> >> Note that no test for this problem has been committed. >>
> >
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: Karl Williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
CC: Sawyer X <xsawyerx [...] gmail.com>, Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
Date: Thu, 27 Apr 2017 20:32:52 +0200
On 27 April 2017 at 19:35, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
> On 04/27/2017 05:13 AM, Sawyer X wrote:
>> >> Just for the official record (and relating to James' email), I'm still >> looking into whether to keep it or not. >> >> (I'd rather discuss it with Yves instead of making a single-sided >> decision.) >> >> Everyone, feel free to provide your opinion. Just keep it polite. :)
> > > I think the patch should stay. It is extremely low risk, and fixes a > regression and hence would be eligible for 5.26.1 anyway.
Summarizes my view too. We regressed in 5.25.x on this, and despite the fact we only noticed during the code freeze give the fix is as safe as a fix can be I see no point in rolling out a known regression in 5.26.x. If there were any doubt that this patch could have a secondary side-effect beside fixing this bug I would agree it should wait, and again, if this had be present in 5.24 I would say it could wait. But as is rolling out 5.26.x with this patch reverted means we are knowingly rolling out 5.26.x with localized regexes broken. Given the extremely low risk this seems extremely irresponsible. Show quoted text
> But, I want to point out that this highlights a flaw in the perl > implementation. Here, a negative number is being used as a sentinel. But in > most of the implementation, code points can be any UV. If a user has just > the right code point point it would be treated as negative and conflated > with this sentinel. There are other situations where negatives are > similarly used. Because of this, code points outside the range 0..IV_MAX > are deprecated, and are scheduled to be forbidden in 5.28.
I think I mispoke when I said that nexchr holds a codepoint, I think I should have said it holds the next *octet*, or in other words my understanding is nexchr is either negative, in which case it is -10, otherwise it is restricted to 0..255. If I am correct in this then I believe the concern you raise here is not a problem, where it would be if it held a code-point. One of us should double check, or maybe Dave can speakup on this. cheers, Yves
Date: Thu, 27 Apr 2017 14:42:22 -0500
From: saint.snit [...] gmail.com
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
Download (untitled) / with headers
text/plain 337b
Show quoted text
> If there were any doubt that this patch could have a secondary > side-effect beside fixing this bug I would agree it should wait, and > again, if this had be present in 5.24 I would say it could wait.
Not to advocate for or against the patch's inclusion, but the bug has existed since at least 5.22.2, which is where I discovered it.
From: Karl Williamson <public [...] khwilliamson.com>
CC: Sawyer X <xsawyerx [...] gmail.com>, Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
Date: Thu, 27 Apr 2017 13:45:46 -0600
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.5k
On 04/27/2017 12:32 PM, demerphq wrote: Show quoted text
>> But, I want to point out that this highlights a flaw in the perl >> implementation. Here, a negative number is being used as a sentinel. But in >> most of the implementation, code points can be any UV. If a user has just >> the right code point point it would be treated as negative and conflated >> with this sentinel. There are other situations where negatives are >> similarly used. Because of this, code points outside the range 0..IV_MAX >> are deprecated, and are scheduled to be forbidden in 5.28.
> I think I mispoke when I said that nexchr holds a codepoint, I think I > should have said it holds the next *octet*, or in other words my > understanding is nexchr is either negative, in which case it is -10, > otherwise it is restricted to 0..255. > > If I am correct in this then I believe the concern you raise here is > not a problem, where it would be if it held a code-point. > > One of us should double check, or maybe Dave can speakup on this. > > cheers, > Yves >
You're right. I didn't look at the code here before writing. But there are places in regexec.c where the sentinels can be conflated with an input code point. I'm now thinking the patch would be better to be Show quoted text
> && !NEXTCHR_IS_EOS
instead of what you have: Show quoted text
> && nextchr >= 0 /* guard against negative EOS value in nextchr */
The macro is: #define NEXTCHR_IS_EOS (nextchr < 0) so it evaluates to the exact same thing as your patch, but uses the paradigm that was created by Dave for this situation, which is used in all other similar cases in this file.
Date: Thu, 27 Apr 2017 13:47:54 -0600
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: saint.snit [...] gmail.com, perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 463b
On 04/27/2017 01:42 PM, saint.snit@gmail.com wrote: Show quoted text
>> If there were any doubt that this patch could have a secondary >> side-effect beside fixing this bug I would agree it should wait, and >> again, if this had be present in 5.24 I would say it could wait.
> > Not to advocate for or against the patch's inclusion, but the bug has > existed since at least 5.22.2, which is where I discovered it. >
Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.
Date: Thu, 27 Apr 2017 17:26:30 -0500
From: saint.snit [...] gmail.com
To: perlbug-followup [...] perl.org
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
Download (untitled) / with headers
text/plain 406b
Show quoted text
> Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.
Upthread, Zefram said it was introduced in v5.21.7-165-g613abc6. However, this was where the warning was introduced -- the underlying misinterpretation of EOS that the warning catches may have existed silently before that. (I suppose it's a philosophical question whether a coding error that produces no incorrect behavior is still a bug.)
From: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: perl5-porters [...] perl.org
Date: Thu, 27 Apr 2017 23:33:57 +0100
Download (untitled) / with headers
text/plain 360b
saint.snit@gmail.com wrote: Show quoted text
>However, this was where the warning was introduced -- the underlying >misinterpretation of EOS that the warning catches may have existed >silently before that.
No. The misinterpretation of EOS exists only in the code that checks whether to emit the warning, the code that was introduced in the commit that I identified. -zefram
To: Zefram <zefram [...] fysh.org>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
Date: Fri, 28 Apr 2017 08:55:10 +0200
CC: Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 500b
If it has been present prior to 5.25 then I will revert. Yves

On 28 Apr 2017 00:34, "Zefram" <zefram@fysh.org> wrote:
Show quoted text
saint.snit@gmail.com wrote:
>However, this was where the warning was introduced -- the underlying
>misinterpretation of EOS that the warning catches may have existed
>silently before that.

No.  The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

-zefram
From: demerphq <demerphq [...] gmail.com>
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Date: Fri, 28 Apr 2017 09:25:53 +0200
To: James E Keenan <jkeenan [...] pobox.com>, Sawyer X <xsawyerx [...] gmail.com>
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
Download (untitled) / with headers
text/plain 756b
On 28 April 2017 at 08:55, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 28 Apr 2017 00:34, "Zefram" <zefram@fysh.org> wrote:
>> >> saint.snit@gmail.com wrote:
>> >However, this was where the warning was introduced -- the underlying >> >misinterpretation of EOS that the warning catches may have existed >> >silently before that.
>> >> No. The misinterpretation of EOS exists only in the code that checks >> whether to emit the warning, the code that was introduced in the commit >> that I identified. >>
> If it has been present prior to 5.25 then I will revert.
Since this was present pre 5.25 I have reverted with b4972372a75776de3c9e6bd234a398d103677316 It still feels kinda wrong but whatever. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: demerphq <demerphq [...] gmail.com>, Sawyer X <xsawyerx [...] gmail.com>, Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
From: Dave Mitchell <davem [...] iabyn.com>
Date: Fri, 28 Apr 2017 09:44:22 +0100
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: Karl Williamson <public [...] khwilliamson.com>
On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote: Show quoted text
> I think I mispoke when I said that nexchr holds a codepoint, I think I > should have said it holds the next *octet*, or in other words my > understanding is nexchr is either negative, in which case it is -10, > otherwise it is restricted to 0..255.
Yep, at the top of the main loop is this: SET_nextchr; assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS)); On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote: Show quoted text
> I'm now thinking the patch would be better to be >
> > && !NEXTCHR_IS_EOS
> > instead of what you have: >
> > && nextchr >= 0 /* guard against negative EOS value in nextchr */
> > The macro is: > > #define NEXTCHR_IS_EOS (nextchr < 0) > > so it evaluates to the exact same thing as your patch, but uses the paradigm > that was created by Dave for this situation, which is used in all other > similar cases in this file.
+1 -- Please note that ash-trays are provided for the use of smokers, whereas the floor is provided for the use of all patrons. -- Bill Royston
From: demerphq <demerphq [...] gmail.com>
CC: Karl Williamson <public [...] khwilliamson.com>, Sawyer X <xsawyerx [...] gmail.com>, Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
Date: Fri, 28 Apr 2017 10:57:11 +0200
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: Dave Mitchell <davem [...] iabyn.com>
Download (untitled) / with headers
text/plain 1.6k
On 28 April 2017 at 10:44, Dave Mitchell <davem@iabyn.com> wrote: Show quoted text
> On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote:
>> I think I mispoke when I said that nexchr holds a codepoint, I think I >> should have said it holds the next *octet*, or in other words my >> understanding is nexchr is either negative, in which case it is -10, >> otherwise it is restricted to 0..255.
> > Yep, at the top of the main loop is this: > > SET_nextchr; > assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));
Thanks, so that is taken care of for now. Show quoted text
> > On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote:
>> I'm now thinking the patch would be better to be >>
>> > && !NEXTCHR_IS_EOS
>> >> instead of what you have: >>
>> > && nextchr >= 0 /* guard against negative EOS value in nextchr */
>> >> The macro is: >> >> #define NEXTCHR_IS_EOS (nextchr < 0) >> >> so it evaluates to the exact same thing as your patch, but uses the paradigm >> that was created by Dave for this situation, which is used in all other >> similar cases in this file.
> > +1
For what its worth I dont mind doing this but I don't like it much. I would prefer that we create a define like #define NEXTCHR_IS_OCTET (nextchr>=0) and then use that. That way we can distinguish between "not end of string" and "is a octet" in our code, particularly if we add a new special value for nextchr. The other approach BTW would be to add a new UTF8 define that does not cast its argument, and use that instead. Or do both. :-) Anyway, since this bug was present prior to 5.25 I have reverted my patch entirely. We have until 5.26 is out to work out a better patch. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #131190] erroneous regex warning after utf8 conversion
To: demerphq <demerphq [...] gmail.com>, Dave Mitchell <davem [...] iabyn.com>
Date: Fri, 28 Apr 2017 14:32:29 +0200
From: Sawyer X <xsawyerx [...] gmail.com>
CC: Karl Williamson <public [...] khwilliamson.com>, Perl5 Porteros <perl5-porters [...] perl.org>, Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 1.7k
On 04/28/2017 10:57 AM, demerphq wrote: Show quoted text
> On 28 April 2017 at 10:44, Dave Mitchell <davem@iabyn.com> wrote:
>> On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote:
>>> I think I mispoke when I said that nexchr holds a codepoint, I think I >>> should have said it holds the next *octet*, or in other words my >>> understanding is nexchr is either negative, in which case it is -10, >>> otherwise it is restricted to 0..255.
>> Yep, at the top of the main loop is this: >> >> SET_nextchr; >> assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));
> Thanks, so that is taken care of for now. >
>> On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote:
>>> I'm now thinking the patch would be better to be >>>
>>>> && !NEXTCHR_IS_EOS
>>> instead of what you have: >>>
>>>> && nextchr >= 0 /* guard against negative EOS value in nextchr */
>>> The macro is: >>> >>> #define NEXTCHR_IS_EOS (nextchr < 0) >>> >>> so it evaluates to the exact same thing as your patch, but uses the paradigm >>> that was created by Dave for this situation, which is used in all other >>> similar cases in this file.
>> +1
> For what its worth I dont mind doing this but I don't like it much. I > would prefer that we create a define like > > #define NEXTCHR_IS_OCTET (nextchr>=0) > > and then use that. That way we can distinguish between "not end of > string" and "is a octet" in our code, particularly if we add a new > special value for nextchr. > > The other approach BTW would be to add a new UTF8 define that does not > cast its argument, and use that instead. > > Or do both. :-) > > Anyway, since this bug was present prior to 5.25 I have reverted my > patch entirely. We have until 5.26 is out to work out a better patch.
Thanks, everyone! :)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org