Inconsistent behaviour when decoding in substitution #15324

p5pRT · 2016-05-13T08:40:56Z

Migrated from rt.perl.org#128143 (status was 'rejected')

Searchable as RT128143$

p5pRT · 2016-05-13T08:40:56Z

From mbu@jobindex.dk

Created by mbu@jobindex.dk

$x =~ s/(.)/$latin1->decode($1)/ge
$x =~ s/(.)/Encode::decode('latin1', $1)/ge

These two expressions can in some cases give different results. See attached
example program for a specific example.

I'm fully aware that decoding an already decoded string is not a good idea.
With that said, I think the inconsistent behaviour shown in the example is
surprising at least. I would assume that the four versions would give the same
result (whatever that would be).

I have reproduced the issue on Perl 5.24.0, 5.22.1 and 5.18.2.

----------------- Example ---------------------------------------
use Encode;
use Devel::Peek;

my $latin1 = find_encoding('latin1');

my $x = "\N{U+00C5}\N{U+0080}";

sub version1 {
my $in = shift;
return Encode::decode('latin1', $in);
}

sub version2 {
return Encode::decode('latin1', $_[0]);
}

sub version3 {
my $in = shift;
return $latin1->decode($in);
}

sub version4 {
return $latin1->decode($_[0]);
}

my $a = $x =~ s/(.)/version1($1)/ger; Dump($a); # "\xC5\x80"
my $b = $x =~ s/(.)/version2($1)/ger; Dump($b); # "\xC5\x80"
my $c = $x =~ s/(.)/version3($1)/ger; Dump($c); # "\xC5\x80"
my $d = $x =~ s/(.)/version4($1)/ger; Dump($d); # "\xC3\x85\xC2\x80"
----------------- Example end -----------------------------------

----------------- Example output --------------------------------
SV = PVMG(0x24f01f0) at 0x23ff2f0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x23f8bd0 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
CUR = 4
LEN = 10
SV = PVMG(0x24f0220) at 0x24f6440
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x2403de0 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
CUR = 4
LEN = 10
SV = PVMG(0x24f0280) at 0x24f62a8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x2403e40 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
CUR = 4
LEN = 10
SV = PVMG(0x24f02b0) at 0x24f67a0
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x2427690 "\303\203\302\205\303\202\302\200"\0 [UTF8 "\x{c3}\x{85}\x{c2}\x{80}"]
CUR = 8
LEN = 10
COW_REFCNT = 0
----------------- Example output end ----------------------------

Perl Info


Flags:
     category=library
     severity=low
     module=Encode

Site configuration information for perl 5.24.0:

Configured by mbu at Thu May 12 11:18:57 CEST 2016.

Summary of my perl5 (revision 5 version 24 subversion 0) configuration:
    
   Platform:
     osname=linux, osvers=3.13.0-85-generic, archname=x86_64-linux
     uname='linux mbu 3.13.0-85-generic #129-ubuntu smp thu mar 17 20:50:15 utc 2016 x86_64 x86_64 x86_64 gnulinux '
     config_args='-de -Dprefix=/home/mbu/perl5/perlbrew/perls/perl-5.24.0 -Aeval:scriptdir=/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=undef, usemultiplicity=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.8.4', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed 
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib 
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
     libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     libc=libc-2.19.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.19'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:
     Devel::PatchPerl 1.38


@INC for perl 5.24.0:
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/site_perl/5.24.0/x86_64-linux
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/site_perl/5.24.0
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/5.24.0/x86_64-linux
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/5.24.0
     .


Environment for perl 5.24.0:
     HOME=/home/mbu
     LANG=en_US.UTF-8
     LANGUAGE=en_AG:en
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/mbu/perl5/perlbrew/bin:/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/mbu/usr/bin:/home/mbu/.fzf/bin:/home/mbu/.cargo/bin
     PERLBREW_BASHRC_VERSION=0.75
     PERLBREW_HOME=/home/mbu/.perlbrew
     PERLBREW_MANPATH=/home/mbu/perl5/perlbrew/perls/perl-5.24.0/man
     PERLBREW_PATH=/home/mbu/perl5/perlbrew/bin:/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin
     PERLBREW_PERL=perl-5.24.0
     PERLBREW_ROOT=/home/mbu/perl5/perlbrew
     PERLBREW_VERSION=0.75
     PERL_BADLANG (unset)
     SHELL=/bin/bash

p5pRT · 2016-05-14T21:03:15Z

From @iabyn

On Fri, May 13, 2016 at 01:40:57AM -0700, Michael Budde wrote:

# New Ticket Created by Michael Budde
# Please include the string: [perl #128143]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=128143 >

This is a bug report for perl from mbu@jobindex.dk,
generated with the help of perlbug 1.40 running under perl 5.24.0.

-----------------------------------------------------------------
[Please describe your issue here]
 $x =~ s/$\.$/$latin1\->decode$$1$/ge
 $x =~ s/$\.$/Encode&#8203;::decode$'latin1'\, $1$/ge
These two expressions can in some cases give different results. See attached
example program for a specific example.

The difference is that $latin1->decode directly calls the correct method,
while with Encode::decode('latin1'), the decode() sub has to run some
extra code to look up 'latin1' and find the correct class before calling
the real method. It so happens that the latter causes extra pattern
matches to be run, which changes the value of $1.

If you directly pass $1 as an arg by reference as perl subs do, rather
than making a copy or passing a stringified copy, then $1 only gets
its value at the point it is evaluated. For example:

sub f {
"b" =~ /(.)/ or die;
print "<$_[0]>\n";
}
"a" =~ /(.)/ or die;
f($1); # prints <b>

$1 et al act a bit like tied variables, and you can see similar artifacts
with foo($tied_scalar) verses foo("$tied_scalar") for example.

--
Nothing ventured, nothing lost.

p5pRT · 2016-05-14T21:03:15Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2016-06-08T02:54:49Z

From @dcollinsn

Dave, his attachment suggests that it's the other way around. Running his testcase shows me that the "odd one out" is the one corresponding to:

s/(.)/$latin1->decode($1)/ger;

Surely, anything that $latin1->decode($1) does to trample on $1, Encode::decode('latin1', $1) will do as well? In fact, the first thing Encode::decode does is break the pass-by-reference with:

my ( $name, $octets, $check ) = @_;

So either $latin1->decode doesn't break the reference, and modifies $1 by assigning to it, or it performs a match that Encode::decode('latin1', $1) doesn't, and modifies $1 through a regex match.

So, I ran a different testcase, Dumping $1 instead of the return from decode(), and all four testcases had only "\x80" in $1 - so it does not seem that $1 is being polluted by a pattern match in $latin1->decode(), nor does it seem that $1 is being changed by reference within $latin1->decode(), since each case should result in the attached modified test failing in some way.

I'll keep poking this, unless I'm missing something obvious.

p5pRT · 2016-06-08T02:54:49Z

From @dcollinsn

128143_test_2.pl

p5pRT · 2016-06-08T03:39:43Z

From zefram@fysh.org

Dan Collins via RT wrote:

Surely, anything that $latin1->decode($1) does to trample on $1,
Encode::decode('latin1', $1) will do as well?

No. $latin1 is an object of the Encode::XS class, so the method
is Encode::XS::decode, which is implemented in XS quite separately
from the pure Perl Encode::decode. The XS does not copy the input,
and one of the first things it does is to downgrade the input scalar.
Without even checking the SvREADONLY flag! Naughty.

-zefram

p5pRT · 2016-06-08T04:34:49Z

From @dcollinsn

Thanks, Zefram, that was a nudge in the right direction. I get it, and I think I fixed it. Let me write a test and I'll submit a patch tomorrow.

p5pRT · 2016-06-08T04:34:49Z

From [Unknown Contact. See original ticket]

Thanks, Zefram, that was a nudge in the right direction. I get it, and I think I fixed it. Let me write a test and I'll submit a patch tomorrow.

p5pRT · 2016-06-08T14:25:09Z

From @dcollinsn

Submitted to Encode's cpan maintainers as [cpan #115168].

p5pRT · 2016-07-18T13:32:21Z

@iabyn - Status changed from 'open' to 'rejected'

p5pRT closed this as completed Jul 18, 2016

p5pRT added Severity Low distro-Linux sendToCPAN labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour when decoding in substitution #15324

Inconsistent behaviour when decoding in substitution #15324

p5pRT commented May 13, 2016

p5pRT commented May 13, 2016

p5pRT commented May 14, 2016

p5pRT commented May 14, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jun 8, 2016

p5pRT commented Jul 18, 2016

Inconsistent behaviour when decoding in substitution #15324

Inconsistent behaviour when decoding in substitution #15324

Comments

p5pRT commented May 13, 2016

p5pRT commented May 13, 2016

From mbu@jobindex.dk

Created by mbu@jobindex.dk

p5pRT commented May 14, 2016

From @iabyn

p5pRT commented May 14, 2016

p5pRT commented Jun 8, 2016

From @dcollinsn

p5pRT commented Jun 8, 2016

From @dcollinsn

p5pRT commented Jun 8, 2016

From zefram@fysh.org

p5pRT commented Jun 8, 2016

From @dcollinsn

p5pRT commented Jun 8, 2016

From [Unknown Contact. See original ticket]

p5pRT commented Jun 8, 2016

From @dcollinsn

p5pRT commented Jul 18, 2016