Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour when decoding in substitution #15324

Closed
p5pRT opened this issue May 13, 2016 · 10 comments
Closed

Inconsistent behaviour when decoding in substitution #15324

p5pRT opened this issue May 13, 2016 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented May 13, 2016

Migrated from rt.perl.org#128143 (status was 'rejected')

Searchable as RT128143$

@p5pRT
Copy link
Author

p5pRT commented May 13, 2016

From mbu@jobindex.dk

Created by mbu@jobindex.dk

  $x =~ s/(.)/$latin1->decode($1)/ge
  $x =~ s/(.)/Encode​::decode('latin1', $1)/ge

These two expressions can in some cases give different results. See attached
example program for a specific example.

I'm fully aware that decoding an already decoded string is not a good idea.
With that said, I think the inconsistent behaviour shown in the example is
surprising at least. I would assume that the four versions would give the same
result (whatever that would be).

I have reproduced the issue on Perl 5.24.0, 5.22.1 and 5.18.2.

----------------- Example ---------------------------------------
use Encode;
use Devel​::Peek;

my $latin1 = find_encoding('latin1');

my $x = "\N{U+00C5}\N{U+0080}";

sub version1 {
  my $in = shift;
  return Encode​::decode('latin1', $in);
}

sub version2 {
  return Encode​::decode('latin1', $_[0]);
}

sub version3 {
  my $in = shift;
  return $latin1->decode($in);
}

sub version4 {
  return $latin1->decode($_[0]);
}

my $a = $x =~ s/(.)/version1($1)/ger; Dump($a); # "\xC5\x80"
my $b = $x =~ s/(.)/version2($1)/ger; Dump($b); # "\xC5\x80"
my $c = $x =~ s/(.)/version3($1)/ger; Dump($c); # "\xC5\x80"
my $d = $x =~ s/(.)/version4($1)/ger; Dump($d); # "\xC3\x85\xC2\x80"
----------------- Example end -----------------------------------

----------------- Example output --------------------------------
SV = PVMG(0x24f01f0) at 0x23ff2f0
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x23f8bd0 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
  CUR = 4
  LEN = 10
SV = PVMG(0x24f0220) at 0x24f6440
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x2403de0 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
  CUR = 4
  LEN = 10
SV = PVMG(0x24f0280) at 0x24f62a8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x2403e40 "\303\205\302\200"\0 [UTF8 "\x{c5}\x{80}"]
  CUR = 4
  LEN = 10
SV = PVMG(0x24f02b0) at 0x24f67a0
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x2427690 "\303\203\302\205\303\202\302\200"\0 [UTF8 "\x{c3}\x{85}\x{c2}\x{80}"]
  CUR = 8
  LEN = 10
  COW_REFCNT = 0
----------------- Example output end ----------------------------

Perl Info

Flags:
     category=library
     severity=low
     module=Encode

Site configuration information for perl 5.24.0:

Configured by mbu at Thu May 12 11:18:57 CEST 2016.

Summary of my perl5 (revision 5 version 24 subversion 0) configuration:
    
   Platform:
     osname=linux, osvers=3.13.0-85-generic, archname=x86_64-linux
     uname='linux mbu 3.13.0-85-generic #129-ubuntu smp thu mar 17 20:50:15 utc 2016 x86_64 x86_64 x86_64 gnulinux '
     config_args='-de -Dprefix=/home/mbu/perl5/perlbrew/perls/perl-5.24.0 -Aeval:scriptdir=/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin'
     hint=recommended, useposix=true, d_sigaction=define
     useithreads=undef, usemultiplicity=undef
     use64bitint=define, use64bitall=define, uselongdouble=undef
     usemymalloc=n, bincompat5005=undef
   Compiler:
     cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
     optimize='-O2',
     cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
     ccversion='', gccversion='4.8.4', gccosandvers=''
     intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678, doublekind=3
     d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16, longdblkind=3
     ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
     alignbytes=8, prototype=define
   Linker and Libraries:
     ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
     libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed 
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib 
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
     libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     libc=libc-2.19.so, so=so, useshrplib=false, libperl=libperl.a
     gnulibc_version='2.19'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
     cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:
     Devel::PatchPerl 1.38


@INC for perl 5.24.0:
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/site_perl/5.24.0/x86_64-linux
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/site_perl/5.24.0
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/5.24.0/x86_64-linux
     /home/mbu/perl5/perlbrew/perls/perl-5.24.0/lib/5.24.0
     .


Environment for perl 5.24.0:
     HOME=/home/mbu
     LANG=en_US.UTF-8
     LANGUAGE=en_AG:en
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
     PATH=/home/mbu/perl5/perlbrew/bin:/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/home/mbu/usr/bin:/home/mbu/.fzf/bin:/home/mbu/.cargo/bin
     PERLBREW_BASHRC_VERSION=0.75
     PERLBREW_HOME=/home/mbu/.perlbrew
     PERLBREW_MANPATH=/home/mbu/perl5/perlbrew/perls/perl-5.24.0/man
     PERLBREW_PATH=/home/mbu/perl5/perlbrew/bin:/home/mbu/perl5/perlbrew/perls/perl-5.24.0/bin
     PERLBREW_PERL=perl-5.24.0
     PERLBREW_ROOT=/home/mbu/perl5/perlbrew
     PERLBREW_VERSION=0.75
     PERL_BADLANG (unset)
     SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented May 14, 2016

From @iabyn

On Fri, May 13, 2016 at 01​:40​:57AM -0700, Michael Budde wrote​:

# New Ticket Created by Michael Budde
# Please include the string​: [perl #128143]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=128143 >

This is a bug report for perl from mbu@​jobindex.dk,
generated with the help of perlbug 1.40 running under perl 5.24.0.

-----------------------------------------------------------------
[Please describe your issue here]

 $x =~ s/\(\.\)/$latin1\->decode\($1\)/ge
 $x =~ s/\(\.\)/Encode&#8203;::decode\('latin1'\, $1\)/ge

These two expressions can in some cases give different results. See attached
example program for a specific example.

The difference is that $latin1->decode directly calls the correct method,
while with Encode​::decode('latin1'), the decode() sub has to run some
extra code to look up 'latin1' and find the correct class before calling
the real method. It so happens that the latter causes extra pattern
matches to be run, which changes the value of $1.

If you directly pass $1 as an arg by reference as perl subs do, rather
than making a copy or passing a stringified copy, then $1 only gets
its value at the point it is evaluated. For example​:

  sub f {
  "b" =~ /(.)/ or die;
  print "<$_[0]>\n";
  }
  "a" =~ /(.)/ or die;
  f($1); # prints <b>

$1 et al act a bit like tied variables, and you can see similar artifacts
with foo($tied_scalar) verses foo("$tied_scalar") for example.

--
Nothing ventured, nothing lost.

@p5pRT
Copy link
Author

p5pRT commented May 14, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From @dcollinsn

Dave, his attachment suggests that it's the other way around. Running his testcase shows me that the "odd one out" is the one corresponding to​:

  s/(.)/$latin1->decode($1)/ger;

Surely, anything that $latin1->decode($1) does to trample on $1, Encode​::decode('latin1', $1) will do as well? In fact, the first thing Encode​::decode does is break the pass-by-reference with​:

  my ( $name, $octets, $check ) = @​_;

So either $latin1->decode doesn't break the reference, and modifies $1 by assigning to it, or it performs a match that Encode​::decode('latin1', $1) doesn't, and modifies $1 through a regex match.

So, I ran a different testcase, Dumping $1 instead of the return from decode(), and all four testcases had only "\x80" in $1 - so it does not seem that $1 is being polluted by a pattern match in $latin1->decode(), nor does it seem that $1 is being changed by reference within $latin1->decode(), since each case should result in the attached modified test failing in some way.

I'll keep poking this, unless I'm missing something obvious.

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From @dcollinsn

128143_test_2.pl

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From zefram@fysh.org

Dan Collins via RT wrote​:

Surely, anything that $latin1->decode($1) does to trample on $1,
Encode​::decode('latin1', $1) will do as well?

No. $latin1 is an object of the Encode​::XS class, so the method
is Encode​::XS​::decode, which is implemented in XS quite separately
from the pure Perl Encode​::decode. The XS does not copy the input,
and one of the first things it does is to downgrade the input scalar.
Without even checking the SvREADONLY flag! Naughty.

-zefram

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From @dcollinsn

Thanks, Zefram, that was a nudge in the right direction. I get it, and I think I fixed it. Let me write a test and I'll submit a patch tomorrow.

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From [Unknown Contact. See original ticket]

Thanks, Zefram, that was a nudge in the right direction. I get it, and I think I fixed it. Let me write a test and I'll submit a patch tomorrow.

@p5pRT
Copy link
Author

p5pRT commented Jun 8, 2016

From @dcollinsn

Submitted to Encode's cpan maintainers as [cpan #115168].

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2016

@iabyn - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant