Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash with unicode characters in regex comment #8656

Closed
p5pRT opened this issue Nov 1, 2006 · 22 comments
Closed

crash with unicode characters in regex comment #8656

p5pRT opened this issue Nov 1, 2006 · 22 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 1, 2006

Migrated from rt.perl.org#40641 (status was 'resolved')

Searchable as RT40641$

@p5pRT
Copy link
Author

p5pRT commented Nov 1, 2006

From h.f.s.@web.de

This is a bug report for perl from h.f.s.@​web.de,
generated with the help of perlbug 1.35 running under perl v5.8.8.


Hi,

perl crashes when I use an umlaut in a comment to a regex in a
Parse​::RecDescent grammar and use encoding 'utf8';. I wasn't able to
find out whose fault it is, because the bug is not triggered when I
remove either the use line, one line of the regex, or the 'ä' in the
comment.

This is the smallest example is was able to create​:

  #!/usr/bin/perl
  use encoding 'utf8';
  use Parse​::RecDescent;
 
  my $parser = testParser();
 
  sub testParser {
  return new Parse​::RecDescent( q(
 
  test : /\. # Gebäude
  (Test)?
  /x
  ));
  }

The version of Parse​::RecDescent is 1.94 from CPAN.

The error message is​:
*** glibc detected *** double free or corruption (!prev)​: 0x08393180
***

The stacktrace is​:
Program received signal SIGABRT, Aborted.
[Switching to Thread -1209919264 (LWP 13199)]
0xb7e78947 in raise () from /lib/tls/libc.so.6
(gdb) bt
#0 0xb7e78947 in raise () from /lib/tls/libc.so.6
#1 0xb7e7a0c9 in abort () from /lib/tls/libc.so.6
#2 0xb7eadfda in __fsetlocking () from /lib/tls/libc.so.6
#3 0xb7eb589f in mallopt () from /lib/tls/libc.so.6
#4 0xb7eb5942 in free () from /lib/tls/libc.so.6
#5 0x0809b517 in Perl_pregfree ()
#6 0x0808a188 in Perl_op_clear ()
#7 0x0808ccac in Perl_op_free ()
#8 0x0808cc6f in Perl_op_free ()
#9 0x0808cc6f in Perl_op_free ()
#10 0x080e8e82 in Perl_leave_scope ()
#11 0x080e93ac in Perl_pop_scope ()
#12 0x080eb4d1 in Perl_pp_leaveeval ()
#13 0x080bc3b9 in Perl_runops_standard ()
#14 0x08063bfd in perl_run ()
#15 0x0805ffd1 in main ()

This was with my system perl version. Continue below...



Flags​:
  category=core
  severity=low


Site configuration information for perl v5.8.8​:

Configured by Debian Project at Sun Aug 6 15​:47​:28 UTC 2006.

Summary of my perl5 (revision 5 version 8 subversion 8) configuration​:
  Platform​:
  osname=linux, osvers=2.6.15-1-686,
archname=i486-linux-gnu-thread-multi
  uname='linux ulises 2.6.15-1-686 #2 mon mar 6 15​:27​:08 utc 2006
i686 gnulinux '
 
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des'
  hint=recommended, useposix=true, d_sigaction=define
  usethreads=define use5005threads=undef useithreads=define
usemultiplicity=define
  useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
  use64bitint=undef use64bitall=undef uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags
='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2',
 
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='4.1.2 20060729 (prerelease) (Debian
4.1.1-10)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
  perllibs=-ldl -lm -lpthread -lc -lcrypt
  libc=/lib/libc-2.3.6.so, so=so, useshrplib=true,
libperl=libperl.so.5.8.8
  gnulibc_version='2.3.6'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:
 


@​INC for perl v5.8.8​:
  /etc/perl
  /usr/local/lib/perl/5.8.8
  /usr/local/share/perl/5.8.8
  /usr/lib/perl5
  /usr/share/perl5
  /usr/lib/perl/5.8
  /usr/share/perl/5.8
  /usr/local/lib/site_perl
  /usr/local/share/perl/5.8.7
  /usr/local/share/perl/5.8.4
  .


Environment for perl v5.8.8​:
  HOME=/home-stueck/hermi
  LANG=de_DE.UTF-8
  LANGUAGE=de_DE​:de​:en_GB​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
 
PATH=/usr/local/bin​:/usr/bin​:/bin​:/usr/bin/X11​:/usr/games​:/home-stueck/hermi/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash


I locally compiled perl v5.9.4 and was able to reproduce the problem
with the same sample program.

The stacktrace​:
(gdb) set args /tmp/minimal.pl
(gdb) run
Starting program​: /tmp/perl-5.9.4/bin/perl5.9.4 /tmp/minimal.pl

Program received signal SIGSEGV, Segmentation fault.
0xb7d9b50f in mallopt () from /lib/tls/libc.so.6
(gdb) bt
#0 0xb7d9b50f in mallopt () from /lib/tls/libc.so.6
#1 0xb7d9b942 in free () from /lib/tls/libc.so.6
#2 0x0808dc58 in Perl_safesysfree (where=0x8445350) at util.c​:250
#3 0x080f4ecf in Perl_sv_clear (sv=0x8459ac8) at sv.c​:5178
#4 0x080f5131 in Perl_sv_free2 (sv=0x8459ac8) at sv.c​:5281
#5 0x08137ee9 in Perl_leave_scope (base=228) at scope.c​:742
#6 0x08134165 in Perl_pop_scope () at scope.c​:99
#7 0x08220b83 in Perl_yyparse () at perly.c​:717
#8 0x0814e825 in S_doeval (gimme=0, startop=0x0, outside=0x83a8a70,
seq=1415) at pp_ctl.c​:2937
#9 0x08152e15 in Perl_pp_entereval () at pp_ctl.c​:3501
#10 0x0808d20d in Perl_runops_debug () at dump.c​:1907
#11 0x080bb726 in S_run_body (oldscope=1) at perl.c​:2391
#12 0x080bad02 in perl_run (my_perl=0x8283008) at perl.c​:2311
#13 0x0805e8e1 in main (argc=2, argv=0xbfa38964, env=0xbfa38970) at
perlmain.c​:113

Regards,
Hermann Schwarting



Site configuration information for perl 5.9.4​:

Configured by hermi at Wed Nov 1 18​:16​:47 CET 2006.

Summary of my perl5 (revision 5 version 9 subversion 4) configuration​:
  Platform​:
  osname=linux, osvers=2.6.17, archname=i686-linux
  uname='linux stueck 2.6.17-custom.1 #1 preempt tue jun 27 22​:09​:46
cest 2006 i686 gnulinux '
  config_args=''
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags
='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-g',
 
cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include'
  ccversion='', gccversion='4.1.2 20060901 (prerelease) (Debian
4.1.1-13)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.3.6.so, so=so, useshrplib=false,
libperl=libperl.a
  gnulibc_version='2.3.6'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches​:


@​INC for perl 5.9.4​:
  /usr/share/perl5
  /usr/local/share/perl/5.8.8
  /tmp/perl-5.9.4/lib/5.9.4/i686-linux
  /tmp/perl-5.9.4/lib/5.9.4
  /tmp/perl-5.9.4/lib/site_perl/5.9.4/i686-linux
  /tmp/perl-5.9.4/lib/site_perl/5.9.4
  .


Environment for perl 5.9.4​:
  HOME=/home-stueck/hermi
  LANG=de_DE.UTF-8
  LANGUAGE=de_DE​:de​:en_GB​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
 
PATH=/usr/local/bin​:/usr/bin​:/bin​:/usr/bin/X11​:/usr/games​:/home-stueck/hermi/bin
  PERL5LIB=/usr/share/perl5​:/usr/local/share/perl/5.8.8
  PERL_BADLANG (unset)
  SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Nov 1, 2006

From guest@guest.guest.xxxxxxxx

I investigated a bit further. With both perl 5.8.8 and 5.9.4
(configuration as above) this works​:
  perl -e 'eval "m/ää/;"'
while this doesn't​:
  perl -u -e 'eval "m/ää/;"'

I guess the problems are related.

The backtrace is​:
(gdb) set args -u -e 'eval "m/ää/;"'
(gdb) run
Starting program​: /tmp/perl-5.9.4/bin/perl5.9.4 -u -e 'eval "m/ää/;"'

Program received signal SIGABRT, Aborted.
0xb7dccd51 in kill () from /lib/tls/libc.so.6
(gdb) bt
#0 0xb7dccd51 in kill () from /lib/tls/libc.so.6
#1 0x080bf744 in Perl_my_unexec () at perl.c​:3440
#2 0x080bab3f in S_parse_body (env=0x0, xsinit=0x805e909 <xs_init>)
at perl.c​:2260
#3 0x080b8de0 in perl_parse (my_perl=0x8283008, xsinit=0x805e909
<xs_init>, argc=4, argv=0xbfc132b4, env=0x0)
  at perl.c​:1607
#4 0x0805e8cb in main (argc=4, argv=0xbfc132b4, env=0xbfc132c8) at
perlmain.c​:111

@p5pRT
Copy link
Author

p5pRT commented Nov 1, 2006

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 1, 2006

From guest@guest.guest.xxxxxxxx

Whoops,

I must have things seriously mixed up to think that -u had anything to
do with Unicode. What I meant to say is

  perl -e 'eval "m/ää/;"'
works while
  perl -e 'use encoding "utf8"; eval "m/ää/;"'
doesn't. Backtrace​:
(gdb) set args -e 'use encoding "utf8"; eval "m/ää/;"'
(gdb) run
Starting program​: /tmp/perl-5.9.4/bin/perl5.9.4 -e 'use
encoding "utf8"; eval "m/ää/;"'
*** glibc detected *** free()​: invalid next size (fast)​: 0x082f1b30
***

Program received signal SIGABRT, Aborted.
0xb7dcb947 in raise () from /lib/tls/libc.so.6
(gdb) bt
#0 0xb7dcb947 in raise () from /lib/tls/libc.so.6
#1 0xb7dcd0c9 in abort () from /lib/tls/libc.so.6
#2 0xb7e00fda in __fsetlocking () from /lib/tls/libc.so.6
#3 0xb7e0889f in mallopt () from /lib/tls/libc.so.6
#4 0xb7e08942 in free () from /lib/tls/libc.so.6
#5 0x0808dc58 in Perl_safesysfree (where=0x82f1b30) at util.c​:250
#6 0x0823d867 in Perl_pregfree (r=0x828b638) at regcomp.c​:6707
#7 0x0805f246 in Perl_op_clear (o=0x828b5d8) at op.c​:471
#8 0x0805eed3 in Perl_op_free (o=0x828b5d8) at op.c​:329
#9 0x0805ee88 in Perl_op_free (o=0x829ebb8) at op.c​:318
#10 0x0805ee88 in Perl_op_free (o=0x829ec10) at op.c​:318
#11 0x08137fcc in Perl_leave_scope (base=3) at scope.c​:751
#12 0x08134165 in Perl_pop_scope () at scope.c​:99
#13 0x0815386d in Perl_pp_leaveeval () at pp_ctl.c​:3570
#14 0x0808d20d in Perl_runops_debug () at dump.c​:1907
#15 0x080bb726 in S_run_body (oldscope=1) at perl.c​:2391
#16 0x080bad02 in perl_run (my_perl=0x8283008) at perl.c​:2311
#17 0x0805e8e1 in main (argc=3, argv=0xbfd5d444, env=0xbfd5d454) at
perlmain.c​:113

Please excuse the noise.

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From @andk

On Wed, 01 Nov 2006 12​:40​:40 -0800, Hermann Schwarting (via RT) <perlbug-followup@​perl.org> said​:

  > use encoding 'utf8';

There are known bugs in C< use encoding 'utf8'; > which usually go
away when you do a C< use utf8; > instead.

That said, I cannot reproduce your bug with any of 10 different
bleedperls. As my machine is rather similar to yours I append my
config below. The only outstanding difference is that I have
use64bitint=define.

--
andreas

Summary of my perl5 (revision 5 version 9 subversion 5) configuration​:
  Platform​:
  osname=linux, osvers=2.6.17-2-k7, archname=i686-linux-64int
  uname='linux k75 2.6.17-2-k7 #1 smp thu aug 31 13​:27​:53 utc 2006 i686 gnulinux '
  config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@​29183 -Dinstallusrbinperl=n -Uversiononly -Doptimize=-g -des -Duse64bitint -Dusedevel'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-g',
  cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include'
  ccversion='', gccversion='4.1.2 20060901 (prerelease) (Debian 4.1.1-13)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib
  libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.3.6.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.3.6'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Characteristics of this binary (from libperl)​:
  Compile-time options​: DEBUGGING PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO
  Locally applied patches​:
  DEVEL
  patchaperlup​: --branch='perl' --upto='29183' --start='17639'
  Built under linux
  Compiled at Nov 1 2006 20​:08​:58
  @​INC​:
  /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@​29183/lib/5.9.5/i686-linux-64int
  /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@​29183/lib/5.9.5
  /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@​29183/lib/site_perl/5.9.5/i686-linux-64int
  /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@​29183/lib/site_perl/5.9.5
  .

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From @demerphq

On 11/2/06, Andreas J. Koenig <andreas.koenig.gmwojprw@​franz.ak.mind.de> wrote​:

On Wed, 01 Nov 2006 12​:40​:40 -0800, Hermann Schwarting (via RT) <perlbug-followup@​perl.org> said​:

use encoding 'utf8';

There are known bugs in C< use encoding 'utf8'; > which usually go
away when you do a C< use utf8; > instead.

That said, I cannot reproduce your bug with any of 10 different
bleedperls. As my machine is rather similar to yours I append my
config below. The only outstanding difference is that I have
use64bitint=define.

This rings a bell for me, didnt we have an issue related to this
recently where the encoding wasnt cascaded into an eval?

Note that backtrack stack, the error is NOT being emitted by the regex
engine, this is happening in the lexing of the eval.

cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From BQW10602@nifty.com

On Thu, 02 Nov 2006 03​:45​:34 +0100, (Andreas J. Koenig) wrote

On Wed, 01 Nov 2006 12​:40​:40 -0800, Hermann Schwarting (via RT) said​:

use encoding 'utf8';

There are known bugs in C< use encoding 'utf8'; > which usually go
away when you do a C< use utf8; > instead.

That said, I cannot reproduce your bug with any of 10 different
bleedperls. As my machine is rather similar to yours I append my
config below. The only outstanding difference is that I have
use64bitint=define.

The following causes crash on perl 5.8.1 and later
(including perl-current),

#!perl
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
  /;
__END__

... and I have often (but not always) encountered such a message​:
Free to wrong pool 272840 not 34 during global destruction.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From BQW10602@nifty.com

On Thu, 2 Nov 2006 09​:23​:53 +0100, demerphq wrote

This rings a bell for me, didnt we have an issue related to this
recently where the encoding wasnt cascaded into an eval?

Note that backtrack stack, the error is NOT being emitted by the regex
engine, this is happening in the lexing of the eval.

I think a chunk (see below) at the bottom of S_regatom()
seems problematic.

6381​: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) {
6382​: const STRLEN oldlen = STR_LEN(ret);
6383​: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen));
6384​:
6385​: if (RExC_utf8)
6386​: SvUTF8_on(sv);
6387​: if (sv_utf8_downgrade(sv, TRUE)) {
6388​: const char * const s = sv_recode_to_utf8(sv, PL_encoding);
6389​: const STRLEN newlen = SvCUR(sv);

ä encoded in 'utf8' is downgraded by sv_utf8_downgrade() at line 6387,
and then sv_recode_to_utf8() recodes the downgraded ä in the native
encoding (ISO-8859-1), but PL_encoding expects the input must be
in 'utf8'.
As this ISO-8859-1 string is of course ill-formed as 'utf8',
the result in s at 6388 is REPLACEMENT CHARACTER (U+FFFD).

I don't see what this chunk is for.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From BQW10602@nifty.com

On Thu, 02 Nov 2006 22​:55​:23 +0900, SADAHIRO Tomoyuki wrote

I think a chunk (see below) at the bottom of S_regatom()
seems problematic.

6381​: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) {
6382​: const STRLEN oldlen = STR_LEN(ret);
6383​: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen));
6384​:
6385​: if (RExC_utf8)
6386​: SvUTF8_on(sv);
6387​: if (sv_utf8_downgrade(sv, TRUE)) {
6388​: const char * const s = sv_recode_to_utf8(sv, PL_encoding);
6389​: const STRLEN newlen = SvCUR(sv);

ä encoded in 'utf8' is downgraded by sv_utf8_downgrade() at line 6387,
and then sv_recode_to_utf8() recodes the downgraded ä in the native
encoding (ISO-8859-1), but PL_encoding expects the input must be
in 'utf8'.
As this ISO-8859-1 string is of course ill-formed as 'utf8',
the result in s at 6388 is REPLACEMENT CHARACTER (U+FFFD).

I don't see what this chunk is for.

I think the below example must clarify the circumstances.

#!perl
use charnames qw(​:full);
$a = "\N{LATIN SMALL LETTER A WITH DIAERESIS}";

use encoding 'utf8';
utf8​::downgrade($a);
utf8​::upgrade($a);

use Devel​::Peek;
Dump($a); # now U+FFFD!
__END__

While sv_utf8_upgrade() is encoding-aware, sv_utf8_downgrade() isn't
encoding-aware. Therefore the downgrade is not the inversion of
the upgrade under the encoding plagma. I believe any sv_utf8_downgrade
under the encoding plagma makes no sense.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 2, 2006

From shouldbedomo@mac.com

On 2006–11–02, at 13​:57, SADAHIRO Tomoyuki wrote​:

The following causes crash on perl 5.8.1 and later
(including perl-current),

#!perl
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
/;
__END__

That's interesting​: the original bug report was using an extended
regex, and your example does not, so they may not be tickling the
same problem​:

On 2006–11–01, at 21​:40, Hermann Schwarting (via RT) wrote​:

    test : /\\\.               \# Gebäude
            \(Test\)?
           /x

Firstly, with 32-bit debugging bleadperl@​29183 on Mac OS X, Hermann's
script does not crash for me whether the regex is extended or not​:

32-bit_perl-current$ ./perl -Ilib -I/usr/local/lib/perl5/site_perl/
5.8.8/ -w
  use encoding 'utf8';
  use Parse​::RecDescent;

  my $parser = testParser();

  sub testParser {
  return new Parse​::RecDescent( q(

  test : /\. # Gebäude
  (Test)?
  /x
  ));
  }
32-bit_perl-current$ ./perl -Ilib -I/usr/local/lib/perl5/site_perl/
5.8.8/ -w
  use encoding 'utf8';
  use Parse​::RecDescent;

  my $parser = testParser();

  sub testParser {
  return new Parse​::RecDescent( q(

  test : /\. # Gebäude
  (Test)?
  /
  ));
  }
32-bit_perl-current$

This may be due to architectural differences between Mac OX on PPC
and Linux on i486.

Secondly, making your regex extended stops perl misbehaving for me,
bolstering my suspicion that you've uncovered a different bug from
Hermann​:

32-bit_perl-current$ ./perl -w -Ilib
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
  /x;
32-bit_perl-current$ ./perl -w -Ilib
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
  /;
perl(6787) malloc​: *** error for object 0x1306ad0​: incorrect checksum
for freed object - object was probably modified after being freed,
break at szone_error to debug
perl(6787) malloc​: *** set a breakpoint in szone_error to debug
32-bit_perl-current$ gdb ./perl
GNU gdb 6.3.50-20050815 (Apple version gdb-573) (Fri Oct 20 15​:54​:33
GMT 2006)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and
you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for
details.
This GDB was configured as "powerpc-apple-darwin"...Reading symbols
for shared libraries .. done

(gdb) b szone_error
Breakpoint 1 at 0x90114024
(gdb) r -Ilib -w
Starting program​: /Volumes/Tottie/Other/src/Perlsmoke/32-bit_perl-
current/perl -Ilib -w
Reading symbols for shared libraries . done
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
  /Reading symbols for shared libraries .. done
Reading symbols for shared libraries . done
  /;

Breakpoint 1, 0x90114024 in szone_error ()
(gdb) where
#0 0x90114024 in szone_error ()
#1 0x90003530 in szone_malloc ()
#2 0x90002e00 in malloc ()
#3 0x0000a6bc in Perl_safesysmalloc (size=60) at util.c​:92
#4 0x001064b8 in Perl_pregcomp (my_perl=0x1800400, exp=0x1106020
"a # Gebääääääääääääude\n", ' ' <repeats 11 times>,
xend=0x1106052 "", pm=0x115c7f0) at regcomp.c​:4020
#5 0x0001b0d0 in Perl_pmruntime (my_perl=0x1800400, o=0x115c7f0,
expr=0x1106060, isreg=112 'p') at op.c​:3228
#6 0x0015a5b8 in Perl_yyparse (my_perl=0x1800400) at perly.y​:730
#7 0x00030828 in S_parse_body (my_perl=0x1800400, env=0x0,
xsinit=0x24fc <xs_init>) at perl.c​:2255
#8 0x0002f3fc in perl_parse (my_perl=0x1800400, xsinit=0x24fc
<xs_init>, argc=3, argv=0xbffff938, env=0x0) at perl.c​:1616
#9 0x00002434 in main (argc=3, argv=0xbffff938, env=0xbffff948) at
perlmain.c​:111
(gdb) quit

I hope that the gdb output means something to somebody. Please let me
know if you want me to poke around and tell you the value of any
variables.
--
Dominic Dunlop

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2006

From BQW10602@nifty.com

On Thu, 2 Nov 2006 18​:24​:54 +0100, Dominic Dunlop wrote

On 2006–11–02, at 13​:57, SADAHIRO Tomoyuki wrote​:

The following causes crash on perl 5.8.1 and later
(including perl-current),

#!perl
use encoding 'utf8';
my $re = qr/a # Gebääääääääääääude
/;
__END__

That's interesting​: the original bug report was using an extended
regex, and your example does not, so they may not be tickling the
same problem​:

Secondly, making your regex extended stops perl misbehaving for me,
bolstering my suspicion that you've uncovered a different bug from
Hermann​:

No, I don't think it is different. Actually Hermann's code
causes calling sv_utf8_downgrade from the bottom of S_regatom().

I think another bug in Parse​::RecDescent is involved in this.
Say, try this, where [c-a] in the comment should be invalid
if it were a regex​:

#!perl
use Parse​::RecDescent;
my $parser = testParser();
sub testParser {
  return new Parse​::RecDescent( q(
  test : /\. # [c-a]
  (Test)?
  /x
  ));
}
__END__

This causes weird messages​:

Variable "$errortext" is not available at Parse/RecDescent.pm line 2917.
Variable "$errorprefix" is not available at Parse/RecDescent.pm line 2917.
Use of uninitialized value $errorprefix in formline at Parse/RecDescent.pm line
2850.
Use of uninitialized value $errortext in formline at Parse/RecDescent.pm line 28
50.
Use of uninitialized value $errortext in formline at Parse/RecDescent.pm line 28
52.

The reason is that the invalid pattern in the comment is
falsely compiled, the error is catched by eval, but the *buggy*
Parse​::RecDescent​::_warn() fails to tell the warning messages.

Parse​::RecDescent 1082-1091
  if (!eval "no strict;
  local \$SIG{__WARN__} = sub {0};
  '' =~ m$ldel$pattern$rdel" and $@​)
  {
  Parse​::RecDescent​::_warn(3, "Token pattern \"m$ldel$pattern$rdel\"
  may not be a valid regular expression",
  $_[5]);
  $@​ =~ s/ at \(eval.*/./;
  Parse​::RecDescent​::_hint($@​);
  }

  '' =~ m$ldel$pattern$rdel" and $@​)

If Parse​::RecDescent​::_warn works well, the warning should be
like this​:

Token pattern "m/\. # [c-a]
  (Test)?
  /" may not be a valid regular expression

Wrongly the modifiers are neglected in the evalled code.

Inline Patch
diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm
--- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm	Wed Apr 09 17:29:37 2003
+++ Parse-RecDescent-new/lib/Parse/RecDescent.pm	Fri Nov 03 12:27:33 2006
@@ -1081,9 +1081,9 @@
 
 	if (!eval "no strict;
 		   local \$SIG{__WARN__} = sub {0};
-		   '' =~ m$ldel$pattern$rdel" and $@)
+		   '' =~ m$ldel$pattern$rdel$mod" and $@)
 	{
-		Parse::RecDescent::_warn(3, "Token pattern \"m$ldel$pattern$rdel\"
+		Parse::RecDescent::_warn(3, "Token pattern \"m$ldel$pattern$rdel$mod\"
 					     may not be a valid regular expression",
 					   $_[5]);
 		$@ =~ s/ at \(eval.*/./;

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2006

From @andk

(TheDamian added to CC)

On Fri, 03 Nov 2006 12​:35​:01 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com> said​:

  > On Thu, 2 Nov 2006 18​:24​:54 +0100, Dominic Dunlop wrote

  > diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm
  > --- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Wed Apr 09 17​:29​:37 2003
  > +++ Parse-RecDescent-new/lib/Parse/RecDescent.pm Fri Nov 03 12​:27​:33 2006

I've also put the patch on CPAN as

  file​: $CPAN/authors/id/A/AN/ANDK/patches/Parse-RecDescent-1.94-SADAHIRO-01.patch.gz
  size​: 1800 bytes
  md5​: a44d52e2cb43b4433f4fe39b0d47c83c

This is not to steal you the show, just a test bed for the upcoming
CPAN.pm release. I hope that's OK for you.

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2006

From BQW10602@nifty.com

On Thu, 02 Nov 2006 22​:55​:23 +0900, SADAHIRO Tomoyuki wrote

On Thu, 2 Nov 2006 09​:23​:53 +0100, demerphq wrote

This rings a bell for me, didnt we have an issue related to this
recently where the encoding wasnt cascaded into an eval?

Note that backtrack stack, the error is NOT being emitted by the regex
engine, this is happening in the lexing of the eval.

I think a chunk (see below) at the bottom of S_regatom()
seems problematic.

How to crash perl in this case​:

1. When SIZE_ONLY is true,
1.1. reg_node doesn't set ret's type;
1.2. PL_regkind[OP(ret)] at line 6381 is never EXACT;
1.3. thus growing RExC_size at 6402 is never performed.
2. When SIZE_ONLY is false,
2.1. sv_recode_to_utf8 converts the string from ä to U+FFFD;
2.2. Copy at 6398 overruns STRING(ret) that is shorter than newlen.

regcomp.c, 6379-6404
79​: /* If the encoding pragma is in effect recode the text of
80​: * any EXACT-kind nodes. */
81​: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) {
82​: const STRLEN oldlen = STR_LEN(ret);
83​: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen));
84​:
85​: if (RExC_utf8)
86​: SvUTF8_on(sv);
87​: if (sv_utf8_downgrade(sv, TRUE)) {
88​: const char * const s = sv_recode_to_utf8(sv, PL_encoding);
89​: const STRLEN newlen = SvCUR(sv);
90​:
91​: if (SvUTF8(sv))
92​: RExC_utf8 = 1;
93​: if (!SIZE_ONLY) {
94​: GET_RE_DEBUG_FLAGS_DECL;
95​: DEBUG_COMPILE_r(PerlIO_printf(Perl_debug_log, "recode %*s
to %*s\n",
96​: (int)oldlen, STRING(ret),
97​: (int)newlen, s));
98​: Copy(s, STRING(ret), newlen, char);
99​: STR_LEN(ret) += newlen - oldlen;
00​: RExC_emit += STR_SZ(newlen) - STR_SZ(oldlen);
01​: } else
02​: RExC_size += STR_SZ(newlen) - STR_SZ(oldlen);
03​: }
04​: }

When the above chunk is removed, following tests in ext/Encode/t/encoding.t
come to fail (cf. http​://public.activestate.com/cgi-bin/perlbrowse/12864 )​:

print "not " unless "\x{3AF}" =~ /\xDF/;
print "ok 17\n";

print "not " unless "\xDF" =~ /\xDF/;
print "ok 18\n";

Perhaps it is necessary to interpret \xHH in regex as encoded
in the encoding specified by PL_encoding, then call sv_recode_to_utf8
for \xHH, and get the corresponding Unicode character.
(for example \xDF to U+03AF)

Note​: real literals like /Gebäude/ are converted to unicode
by calling sv_recode_to_utf8() from scan_const() in toke.c.
But metacharacters like /\xHH/ are skipped in scan_const(),
and parsed by regatom() and regclass() in regcomp.c.

P.S. though the fix is not yet, a test suite for this report
is attached to this mail as encoding-test.gz

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2006

@p5pRT
Copy link
Author

p5pRT commented Nov 3, 2006

From BQW10602@nifty.com

On Fri, 03 Nov 2006 06​:45​:04 +0100, Andreas J. Koenig wrote

(TheDamian added to CC)

On Fri, 03 Nov 2006 12​:35​:01 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com> said​:

On Thu, 2 Nov 2006 18​:24​:54 +0100, Dominic Dunlop wrote

diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm
--- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Wed Apr 09 17​:29​:37 2003
+++ Parse-RecDescent-new/lib/Parse/RecDescent.pm Fri Nov 03 12​:27​:33 2006

I've also put the patch on CPAN as

file​: $CPAN/authors/id/A/AN/ANDK/patches/Parse-RecDescent-1.94-SADAHIRO-01.patch.gz
size​: 1800 bytes
md5​: a44d52e2cb43b4433f4fe39b0d47c83c

This is not to steal you the show, just a test bed for the upcoming
CPAN.pm release. I hope that's OK for you.

Thank you. I'm wondering how this module is maintained currently.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

From BQW10602@nifty.com

On Fri, 03 Nov 2006 16​:23​:13 +0900, SADAHIRO Tomoyuki wrote

regcomp.c, 6379-6404

When the above chunk is removed, following tests in ext/Encode/t/encoding.t
come to fail (cf. http​://public.activestate.com/cgi-bin/perlbrowse/12864 )​:

Perhaps it is necessary to interpret \xHH in regex as encoded
in the encoding specified by PL_encoding, then call sv_recode_to_utf8
for \xHH, and get the corresponding Unicode character.
(for example \xDF to U+03AF)

Here is a patch (encoding.patch.gz).

- if PL_encoding, regatom() recodes only octal and hexadecimal escapes.
  This avoids double recoding of raw literal characters.
- regclass() also recodes escapes according to PL_encoding.
  Now escapes in a character class get aware of the current encoding.
- function S_reg_recode() in regcomp.c is defined.
- new test files, t/uni/greek.t and t/uni/latin2.t, are added.

Though these changes concern the functionality of encoding.pm
(that lives a dual life), they can't be backported without changes
against the core. Hence all changes are placed in p5p's space.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

From BQW10602@nifty.com

On Sat, 04 Nov 2006 21​:02​:10 +0900, SADAHIRO Tomoyuki wrote

Here is a patch (encoding.patch.gz).

- if PL_encoding, regatom() recodes only octal and hexadecimal escapes.
This avoids double recoding of raw literal characters.
- regclass() also recodes escapes according to PL_encoding.
Now escapes in a character class get aware of the current encoding.
- function S_reg_recode() in regcomp.c is defined.
- new test files, t/uni/greek.t and t/uni/latin2.t, are added.

Though these changes concern the functionality of encoding.pm
(that lives a dual life), they can't be backported without changes
against the core. Hence all changes are placed in p5p's space.

Then the patch is renewed, encoding2.gz

- Test suites in the above patch emit messeges including non-ASCII
  characters, but that might be troublesome on some environments.

Therefore the difference is only in messeges from test suites.

Regards,
SADAHIRO Tomoyuki

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

From BQW10602@nifty.com

encoding2.gz

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

From @demerphq

On 11/4/06, SADAHIRO Tomoyuki <bqw10602@​nifty.com> wrote​:

On Sat, 04 Nov 2006 21​:02​:10 +0900, SADAHIRO Tomoyuki wrote

Here is a patch (encoding.patch.gz).

- if PL_encoding, regatom() recodes only octal and hexadecimal escapes.
This avoids double recoding of raw literal characters.
- regclass() also recodes escapes according to PL_encoding.
Now escapes in a character class get aware of the current encoding.
- function S_reg_recode() in regcomp.c is defined.
- new test files, t/uni/greek.t and t/uni/latin2.t, are added.

Though these changes concern the functionality of encoding.pm
(that lives a dual life), they can't be backported without changes
against the core. Hence all changes are placed in p5p's space.

Then the patch is renewed, encoding2.gz

- Test suites in the above patch emit messeges including non-ASCII
characters, but that might be troublesome on some environments.

Therefore the difference is only in messeges from test suites.

Sadahiro San++

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2006

From @Tux

On Sat, 04 Nov 2006 21​:53​:50 +0900, SADAHIRO Tomoyuki <bqw10602@​nifty.com>
wrote​:

On Sat, 04 Nov 2006 21​:02​:10 +0900, SADAHIRO Tomoyuki wrote

Here is a patch (encoding.patch.gz).

- if PL_encoding, regatom() recodes only octal and hexadecimal escapes.
This avoids double recoding of raw literal characters.
- regclass() also recodes escapes according to PL_encoding.
Now escapes in a character class get aware of the current encoding.
- function S_reg_recode() in regcomp.c is defined.
- new test files, t/uni/greek.t and t/uni/latin2.t, are added.

Though these changes concern the functionality of encoding.pm
(that lives a dual life), they can't be backported without changes
against the core. Hence all changes are placed in p5p's space.

Then the patch is renewed, encoding2.gz

All tests successful (1 subtest UNEXPECTEDLY SUCCEEDED), 39 tests and 340 subtests skipped.
Thanks, applied as Change #29204

- Test suites in the above patch emit messeges including non-ASCII
characters, but that might be troublesome on some environments.

Therefore the difference is only in messeges from test suites.

--
H.Merijn Brand Amsterdam Perl Mongers (http​://amsterdam.pm.org/)
using & porting perl 5.6.2, 5.8.x, 5.9.x on HP-UX 10.20, 11.00, 11.11,
& 11.23, SuSE 10.0 & 10.1, AIX 4.3 & 5.2, and Cygwin. http​://qa.perl.org
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org
  http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Nov 6, 2006

@rgs - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant