Skip Menu |
Report information

Subject: crash with unicode characters in regex comment
Date: Wed, 1 Nov 2006 21:38:22 +0100
To: perlbug [...] perl.org
From: Hermann Schwarting <h.f.s. [...] web.de>
Download (untitled) / with headers
text/plain 8.5k
This is a bug report for perl from h.f.s.@web.de, generated with the help of perlbug 1.35 running under perl v5.8.8. ----------------------------------------------------------------- Hi, perl crashes when I use an umlaut in a comment to a regex in a Parse::RecDescent grammar and use encoding 'utf8';. I wasn't able to find out whose fault it is, because the bug is not triggered when I remove either the use line, one line of the regex, or the 'ä' in the comment. This is the smallest example is was able to create: #!/usr/bin/perl use encoding 'utf8'; use Parse::RecDescent; my $parser = testParser(); sub testParser { return new Parse::RecDescent( q( test : /\. # Gebäude (Test)? /x )); } The version of Parse::RecDescent is 1.94 from CPAN. The error message is: *** glibc detected *** double free or corruption (!prev): 0x08393180 *** The stacktrace is: Program received signal SIGABRT, Aborted. [Switching to Thread -1209919264 (LWP 13199)] 0xb7e78947 in raise () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7e78947 in raise () from /lib/tls/libc.so.6 #1 0xb7e7a0c9 in abort () from /lib/tls/libc.so.6 #2 0xb7eadfda in __fsetlocking () from /lib/tls/libc.so.6 #3 0xb7eb589f in mallopt () from /lib/tls/libc.so.6 #4 0xb7eb5942 in free () from /lib/tls/libc.so.6 #5 0x0809b517 in Perl_pregfree () #6 0x0808a188 in Perl_op_clear () #7 0x0808ccac in Perl_op_free () #8 0x0808cc6f in Perl_op_free () #9 0x0808cc6f in Perl_op_free () #10 0x080e8e82 in Perl_leave_scope () #11 0x080e93ac in Perl_pop_scope () #12 0x080eb4d1 in Perl_pp_leaveeval () #13 0x080bc3b9 in Perl_runops_standard () #14 0x08063bfd in perl_run () #15 0x0805ffd1 in main () This was with my system perl version. Continue below... ----------------------------------------------------------------- --- Flags: category=core severity=low --- Site configuration information for perl v5.8.8: Configured by Debian Project at Sun Aug 6 15:47:28 UTC 2006. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.15-1-686, archname=i486-linux-gnu-thread-multi uname='linux ulises 2.6.15-1-686 #2 mon mar 6 15:27:08 utc 2006 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.2 20060729 (prerelease) (Debian 4.1.1-10)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8 gnulibc_version='2.3.6' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: --- @INC for perl v5.8.8: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/share/perl/5.8.7 /usr/local/share/perl/5.8.4 . --- Environment for perl v5.8.8: HOME=/home-stueck/hermi LANG=de_DE.UTF-8 LANGUAGE=de_DE:de:en_GB:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/home-stueck/hermi/bin PERL_BADLANG (unset) SHELL=/bin/bash ----------------------------------------------------------------- I locally compiled perl v5.9.4 and was able to reproduce the problem with the same sample program. The stacktrace: (gdb) set args /tmp/minimal.pl (gdb) run Starting program: /tmp/perl-5.9.4/bin/perl5.9.4 /tmp/minimal.pl Program received signal SIGSEGV, Segmentation fault. 0xb7d9b50f in mallopt () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7d9b50f in mallopt () from /lib/tls/libc.so.6 #1 0xb7d9b942 in free () from /lib/tls/libc.so.6 #2 0x0808dc58 in Perl_safesysfree (where=0x8445350) at util.c:250 #3 0x080f4ecf in Perl_sv_clear (sv=0x8459ac8) at sv.c:5178 #4 0x080f5131 in Perl_sv_free2 (sv=0x8459ac8) at sv.c:5281 #5 0x08137ee9 in Perl_leave_scope (base=228) at scope.c:742 #6 0x08134165 in Perl_pop_scope () at scope.c:99 #7 0x08220b83 in Perl_yyparse () at perly.c:717 #8 0x0814e825 in S_doeval (gimme=0, startop=0x0, outside=0x83a8a70, seq=1415) at pp_ctl.c:2937 #9 0x08152e15 in Perl_pp_entereval () at pp_ctl.c:3501 #10 0x0808d20d in Perl_runops_debug () at dump.c:1907 #11 0x080bb726 in S_run_body (oldscope=1) at perl.c:2391 #12 0x080bad02 in perl_run (my_perl=0x8283008) at perl.c:2311 #13 0x0805e8e1 in main (argc=2, argv=0xbfa38964, env=0xbfa38970) at perlmain.c:113 Regards, Hermann Schwarting ----------------------------------------------------------------- --- Site configuration information for perl 5.9.4: Configured by hermi at Wed Nov 1 18:16:47 CET 2006. Summary of my perl5 (revision 5 version 9 subversion 4) configuration: Platform: osname=linux, osvers=2.6.17, archname=i686-linux uname='linux stueck 2.6.17-custom.1 #1 preempt tue jun 27 22:09:46 cest 2006 i686 gnulinux ' config_args='' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include' ccversion='', gccversion='4.1.2 20060901 (prerelease) (Debian 4.1.1-13)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.3.6.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.3.6' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: --- @INC for perl 5.9.4: /usr/share/perl5 /usr/local/share/perl/5.8.8 /tmp/perl-5.9.4/lib/5.9.4/i686-linux /tmp/perl-5.9.4/lib/5.9.4 /tmp/perl-5.9.4/lib/site_perl/5.9.4/i686-linux /tmp/perl-5.9.4/lib/site_perl/5.9.4 . --- Environment for perl 5.9.4: HOME=/home-stueck/hermi LANG=de_DE.UTF-8 LANGUAGE=de_DE:de:en_GB:en LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games:/home-stueck/hermi/bin PERL5LIB=/usr/share/perl5:/usr/local/share/perl/5.8.8 PERL_BADLANG (unset) SHELL=/bin/bash -----------------------------------------------------------------
Subject: crash with unicode characters in regex in eval
From: Hermann Schwarting <h.f.s. [...] web.de>
Download (untitled) / with headers
text/plain 853b
I investigated a bit further. With both perl 5.8.8 and 5.9.4 (configuration as above) this works: perl -e 'eval "m/ää/;"' while this doesn't: perl -u -e 'eval "m/ää/;"' I guess the problems are related. The backtrace is: (gdb) set args -u -e 'eval "m/ää/;"' (gdb) run Starting program: /tmp/perl-5.9.4/bin/perl5.9.4 -u -e 'eval "m/ää/;"' Program received signal SIGABRT, Aborted. 0xb7dccd51 in kill () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7dccd51 in kill () from /lib/tls/libc.so.6 #1 0x080bf744 in Perl_my_unexec () at perl.c:3440 #2 0x080bab3f in S_parse_body (env=0x0, xsinit=0x805e909 <xs_init>) at perl.c:2260 #3 0x080b8de0 in perl_parse (my_perl=0x8283008, xsinit=0x805e909 <xs_init>, argc=4, argv=0xbfc132b4, env=0x0) at perl.c:1607 #4 0x0805e8cb in main (argc=4, argv=0xbfc132b4, env=0xbfc132c8) at perlmain.c:111
From: Hermann Schwarting <h.f.s. [...] web.de>
Download (untitled) / with headers
text/plain 1.5k
Whoops, I must have things seriously mixed up to think that -u had anything to do with Unicode. What I meant to say is perl -e 'eval "m/ää/;"' works while perl -e 'use encoding "utf8"; eval "m/ää/;"' doesn't. Backtrace: (gdb) set args -e 'use encoding "utf8"; eval "m/ää/;"' (gdb) run Starting program: /tmp/perl-5.9.4/bin/perl5.9.4 -e 'use encoding "utf8"; eval "m/ää/;"' *** glibc detected *** free(): invalid next size (fast): 0x082f1b30 *** Program received signal SIGABRT, Aborted. 0xb7dcb947 in raise () from /lib/tls/libc.so.6 (gdb) bt #0 0xb7dcb947 in raise () from /lib/tls/libc.so.6 #1 0xb7dcd0c9 in abort () from /lib/tls/libc.so.6 #2 0xb7e00fda in __fsetlocking () from /lib/tls/libc.so.6 #3 0xb7e0889f in mallopt () from /lib/tls/libc.so.6 #4 0xb7e08942 in free () from /lib/tls/libc.so.6 #5 0x0808dc58 in Perl_safesysfree (where=0x82f1b30) at util.c:250 #6 0x0823d867 in Perl_pregfree (r=0x828b638) at regcomp.c:6707 #7 0x0805f246 in Perl_op_clear (o=0x828b5d8) at op.c:471 #8 0x0805eed3 in Perl_op_free (o=0x828b5d8) at op.c:329 #9 0x0805ee88 in Perl_op_free (o=0x829ebb8) at op.c:318 #10 0x0805ee88 in Perl_op_free (o=0x829ec10) at op.c:318 #11 0x08137fcc in Perl_leave_scope (base=3) at scope.c:751 #12 0x08134165 in Perl_pop_scope () at scope.c:99 #13 0x0815386d in Perl_pp_leaveeval () at pp_ctl.c:3570 #14 0x0808d20d in Perl_runops_debug () at dump.c:1907 #15 0x080bb726 in S_run_body (oldscope=1) at perl.c:2391 #16 0x080bad02 in perl_run (my_perl=0x8283008) at perl.c:2311 #17 0x0805e8e1 in main (argc=3, argv=0xbfd5d444, env=0xbfd5d454) at perlmain.c:113 Please excuse the noise.
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 02 Nov 2006 03:45:34 +0100
To: perl5-porters [...] perl.org
From: andreas.koenig.gmwojprw [...] franz.ak.mind.de (Andreas J. Koenig)
Download (untitled) / with headers
text/plain 2.8k
Show quoted text
>>>>> On Wed, 01 Nov 2006 12:40:40 -0800, Hermann Schwarting (via RT) <perlbug-followup@perl.org> said:
Show quoted text
> use encoding 'utf8';
There are known bugs in C< use encoding 'utf8'; > which usually go away when you do a C< use utf8; > instead. That said, I cannot reproduce your bug with any of 10 different bleedperls. As my machine is rather similar to yours I append my config below. The only outstanding difference is that I have use64bitint=define. -- andreas Summary of my perl5 (revision 5 version 9 subversion 5) configuration: Platform: osname=linux, osvers=2.6.17-2-k7, archname=i686-linux-64int uname='linux k75 2.6.17-2-k7 #1 smp thu aug 31 13:27:53 utc 2006 i686 gnulinux ' config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@29183 -Dinstallusrbinperl=n -Uversiononly -Doptimize=-g -des -Duse64bitint -Dusedevel' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.2 20060901 (prerelease) (Debian 4.1.1-13)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.3.6.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.3.6' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: DEBUGGING PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP USE_64_BIT_INT USE_LARGE_FILES USE_PERLIO Locally applied patches: DEVEL patchaperlup: --branch='perl' --upto='29183' --start='17639' Built under linux Compiled at Nov 1 2006 20:08:58 @INC: /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@29183/lib/5.9.5/i686-linux-64int /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@29183/lib/5.9.5 /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@29183/lib/site_perl/5.9.5/i686-linux-64int /home/src/perl/repoperls/installed-perls/perl/pCbaqkQ/perl-5.8.0@29183/lib/site_perl/5.9.5 .
CC: perl5-porters [...] perl.org
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 2 Nov 2006 09:23:53 +0100
To: "Andreas J. Koenig" <andreas.koenig.gmwojprw [...] franz.ak.mind.de>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 870b
On 11/2/06, Andreas J. Koenig <andreas.koenig.gmwojprw@franz.ak.mind.de> wrote: Show quoted text
> >>>>> On Wed, 01 Nov 2006 12:40:40 -0800, Hermann Schwarting (via RT) <perlbug-followup@perl.org> said:
>
> > use encoding 'utf8';
> > There are known bugs in C< use encoding 'utf8'; > which usually go > away when you do a C< use utf8; > instead. > > That said, I cannot reproduce your bug with any of 10 different > bleedperls. As my machine is rather similar to yours I append my > config below. The only outstanding difference is that I have > use64bitint=define.
This rings a bell for me, didnt we have an issue related to this recently where the encoding wasnt cascaded into an eval? Note that backtrack stack, the error is NOT being emitted by the regex engine, this is happening in the lexing of the eval. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 02 Nov 2006 21:57:49 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 848b
On Thu, 02 Nov 2006 03:45:34 +0100, (Andreas J. Koenig) wrote Show quoted text
> >>>>> On Wed, 01 Nov 2006 12:40:40 -0800, Hermann Schwarting (via RT) said:
>
> > use encoding 'utf8';
> > There are known bugs in C< use encoding 'utf8'; > which usually go > away when you do a C< use utf8; > instead. > > That said, I cannot reproduce your bug with any of 10 different > bleedperls. As my machine is rather similar to yours I append my > config below. The only outstanding difference is that I have > use64bitint=define.
The following causes crash on perl 5.8.1 and later (including perl-current), #!perl use encoding 'utf8'; my $re = qr/a # Gebääääääääääääude /; __END__ ... and I have often (but not always) encountered such a message: Free to wrong pool 272840 not 34 during global destruction. Regards, SADAHIRO Tomoyuki
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 02 Nov 2006 22:55:23 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 1.1k
On Thu, 2 Nov 2006 09:23:53 +0100, demerphq wrote Show quoted text
> > This rings a bell for me, didnt we have an issue related to this > recently where the encoding wasnt cascaded into an eval? > > Note that backtrack stack, the error is NOT being emitted by the regex > engine, this is happening in the lexing of the eval.
I think a chunk (see below) at the bottom of S_regatom() seems problematic. 6381: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) { 6382: const STRLEN oldlen = STR_LEN(ret); 6383: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen)); 6384: 6385: if (RExC_utf8) 6386: SvUTF8_on(sv); 6387: if (sv_utf8_downgrade(sv, TRUE)) { 6388: const char * const s = sv_recode_to_utf8(sv, PL_encoding); 6389: const STRLEN newlen = SvCUR(sv); ä encoded in 'utf8' is downgraded by sv_utf8_downgrade() at line 6387, and then sv_recode_to_utf8() recodes the downgraded ä in the native encoding (ISO-8859-1), but PL_encoding expects the input must be in 'utf8'. As this ISO-8859-1 string is of course ill-formed as 'utf8', the result in s at 6388 is REPLACEMENT CHARACTER (U+FFFD). I don't see what this chunk is for. Regards, SADAHIRO Tomoyuki
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 02 Nov 2006 23:14:09 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 1.4k
On Thu, 02 Nov 2006 22:55:23 +0900, SADAHIRO Tomoyuki wrote Show quoted text
> I think a chunk (see below) at the bottom of S_regatom() > seems problematic. > > 6381: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) { > 6382: const STRLEN oldlen = STR_LEN(ret); > 6383: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen)); > 6384: > 6385: if (RExC_utf8) > 6386: SvUTF8_on(sv); > 6387: if (sv_utf8_downgrade(sv, TRUE)) { > 6388: const char * const s = sv_recode_to_utf8(sv, PL_encoding); > 6389: const STRLEN newlen = SvCUR(sv); > > ä encoded in 'utf8' is downgraded by sv_utf8_downgrade() at line 6387, > and then sv_recode_to_utf8() recodes the downgraded ä in the native > encoding (ISO-8859-1), but PL_encoding expects the input must be > in 'utf8'. > As this ISO-8859-1 string is of course ill-formed as 'utf8', > the result in s at 6388 is REPLACEMENT CHARACTER (U+FFFD). > > I don't see what this chunk is for.
I think the below example must clarify the circumstances. #!perl use charnames qw(:full); $a = "\N{LATIN SMALL LETTER A WITH DIAERESIS}"; use encoding 'utf8'; utf8::downgrade($a); utf8::upgrade($a); use Devel::Peek; Dump($a); # now U+FFFD! __END__ While sv_utf8_upgrade() is encoding-aware, sv_utf8_downgrade() isn't encoding-aware. Therefore the downgrade is not the inversion of the upgrade under the encoding plagma. I believe any sv_utf8_downgrade under the encoding plagma makes no sense. Regards, SADAHIRO Tomoyuki
CC: perl5-porters [...] perl.org
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Thu, 2 Nov 2006 18:24:54 +0100
To: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
From: Dominic Dunlop <shouldbedomo [...] mac.com>
Download (untitled) / with headers
text/plain 4.1k
On 2006–11–02, at 13:57, SADAHIRO Tomoyuki wrote: Show quoted text
> The following causes crash on perl 5.8.1 and later > (including perl-current), > > #!perl > use encoding 'utf8'; > my $re = qr/a # Gebääääääääääääude > /; > __END__
That's interesting: the original bug report was using an extended regex, and your example does not, so they may not be tickling the same problem: On 2006–11–01, at 21:40, Hermann Schwarting (via RT) wrote: Show quoted text
> test : /\. # Gebäude > (Test)? > /x
Firstly, with 32-bit debugging bleadperl@29183 on Mac OS X, Hermann's script does not crash for me whether the regex is extended or not: 32-bit_perl-current$ ./perl -Ilib -I/usr/local/lib/perl5/site_perl/ 5.8.8/ -w use encoding 'utf8'; use Parse::RecDescent; my $parser = testParser(); sub testParser { return new Parse::RecDescent( q( test : /\. # Gebäude (Test)? /x )); } 32-bit_perl-current$ ./perl -Ilib -I/usr/local/lib/perl5/site_perl/ 5.8.8/ -w use encoding 'utf8'; use Parse::RecDescent; my $parser = testParser(); sub testParser { return new Parse::RecDescent( q( test : /\. # Gebäude (Test)? / )); } 32-bit_perl-current$ This may be due to architectural differences between Mac OX on PPC and Linux on i486. Secondly, making your regex extended stops perl misbehaving for me, bolstering my suspicion that you've uncovered a different bug from Hermann: 32-bit_perl-current$ ./perl -w -Ilib use encoding 'utf8'; my $re = qr/a # Gebääääääääääääude /x; 32-bit_perl-current$ ./perl -w -Ilib use encoding 'utf8'; my $re = qr/a # Gebääääääääääääude /; perl(6787) malloc: *** error for object 0x1306ad0: incorrect checksum for freed object - object was probably modified after being freed, break at szone_error to debug perl(6787) malloc: *** set a breakpoint in szone_error to debug 32-bit_perl-current$ gdb ./perl GNU gdb 6.3.50-20050815 (Apple version gdb-573) (Fri Oct 20 15:54:33 GMT 2006) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "powerpc-apple-darwin"...Reading symbols for shared libraries .. done (gdb) b szone_error Breakpoint 1 at 0x90114024 (gdb) r -Ilib -w Starting program: /Volumes/Tottie/Other/src/Perlsmoke/32-bit_perl- current/perl -Ilib -w Reading symbols for shared libraries . done use encoding 'utf8'; my $re = qr/a # Gebääääääääääääude /Reading symbols for shared libraries .. done Reading symbols for shared libraries . done /; Breakpoint 1, 0x90114024 in szone_error () (gdb) where #0 0x90114024 in szone_error () #1 0x90003530 in szone_malloc () #2 0x90002e00 in malloc () #3 0x0000a6bc in Perl_safesysmalloc (size=60) at util.c:92 #4 0x001064b8 in Perl_pregcomp (my_perl=0x1800400, exp=0x1106020 "a # Gebääääääääääääude\n", ' ' <repeats 11 times>, xend=0x1106052 "", pm=0x115c7f0) at regcomp.c:4020 #5 0x0001b0d0 in Perl_pmruntime (my_perl=0x1800400, o=0x115c7f0, expr=0x1106060, isreg=112 'p') at op.c:3228 #6 0x0015a5b8 in Perl_yyparse (my_perl=0x1800400) at perly.y:730 #7 0x00030828 in S_parse_body (my_perl=0x1800400, env=0x0, xsinit=0x24fc <xs_init>) at perl.c:2255 #8 0x0002f3fc in perl_parse (my_perl=0x1800400, xsinit=0x24fc <xs_init>, argc=3, argv=0xbffff938, env=0x0) at perl.c:1616 #9 0x00002434 in main (argc=3, argv=0xbffff938, env=0xbffff948) at perlmain.c:111 (gdb) quit I hope that the gdb output means something to somebody. Please let me know if you want me to poke around and tell you the value of any variables. -- Dominic Dunlop
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Fri, 03 Nov 2006 12:35:01 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 2.9k
On Thu, 2 Nov 2006 18:24:54 +0100, Dominic Dunlop wrote Show quoted text
> On 2006–11–02, at 13:57, SADAHIRO Tomoyuki wrote: >
> > The following causes crash on perl 5.8.1 and later > > (including perl-current), > > > > #!perl > > use encoding 'utf8'; > > my $re = qr/a # Gebääääääääääääude > > /; > > __END__
> > That's interesting: the original bug report was using an extended > regex, and your example does not, so they may not be tickling the > same problem:
Show quoted text
> Secondly, making your regex extended stops perl misbehaving for me, > bolstering my suspicion that you've uncovered a different bug from > Hermann:
No, I don't think it is different. Actually Hermann's code causes calling sv_utf8_downgrade from the bottom of S_regatom(). I think another bug in Parse::RecDescent is involved in this. Say, try this, where [c-a] in the comment should be invalid if it were a regex: #!perl use Parse::RecDescent; my $parser = testParser(); sub testParser { return new Parse::RecDescent( q( test : /\. # [c-a] (Test)? /x )); } __END__ This causes weird messages: Variable "$errortext" is not available at Parse/RecDescent.pm line 2917. Variable "$errorprefix" is not available at Parse/RecDescent.pm line 2917. Use of uninitialized value $errorprefix in formline at Parse/RecDescent.pm line 2850. Use of uninitialized value $errortext in formline at Parse/RecDescent.pm line 28 50. Use of uninitialized value $errortext in formline at Parse/RecDescent.pm line 28 52. The reason is that the invalid pattern in the comment is falsely compiled, the error is catched by eval, but the *buggy* Parse::RecDescent::_warn() fails to tell the warning messages. Parse::RecDescent 1082-1091 if (!eval "no strict; local \$SIG{__WARN__} = sub {0}; '' =~ m$ldel$pattern$rdel" and $@) { Parse::RecDescent::_warn(3, "Token pattern \"m$ldel$pattern$rdel\" may not be a valid regular expression", $_[5]); $@ =~ s/ at \(eval.*/./; Parse::RecDescent::_hint($@); } '' =~ m$ldel$pattern$rdel" and $@) If Parse::RecDescent::_warn works well, the warning should be like this: Token pattern "m/\. # [c-a] (Test)? /" may not be a valid regular expression Wrongly the modifiers are neglected in the evalled code. diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm --- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Wed Apr 09 17:29:37 2003 +++ Parse-RecDescent-new/lib/Parse/RecDescent.pm Fri Nov 03 12:27:33 2006 @@ -1081,9 +1081,9 @@ if (!eval "no strict; local \$SIG{__WARN__} = sub {0}; - '' =~ m$ldel$pattern$rdel" and $@) + '' =~ m$ldel$pattern$rdel$mod" and $@) { - Parse::RecDescent::_warn(3, "Token pattern \"m$ldel$pattern$rdel\" + Parse::RecDescent::_warn(3, "Token pattern \"m$ldel$pattern$rdel$mod\" may not be a valid regular expression", $_[5]); $@ =~ s/ at \(eval.*/./; Regards, SADAHIRO Tomoyuki
CC: perl5-porters [...] perl.org, Damian Conway <damian [...] conway.org>
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Fri, 03 Nov 2006 06:45:04 +0100
To: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
From: andreas.koenig.gmwojprw [...] franz.ak.mind.de (Andreas J. Koenig)
Download (untitled) / with headers
text/plain 745b
(TheDamian added to CC) Show quoted text
>>>>> On Fri, 03 Nov 2006 12:35:01 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com> said:
Show quoted text
> On Thu, 2 Nov 2006 18:24:54 +0100, Dominic Dunlop wrote
Show quoted text
> diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm > --- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Wed Apr 09 17:29:37 2003 > +++ Parse-RecDescent-new/lib/Parse/RecDescent.pm Fri Nov 03 12:27:33 2006
I've also put the patch on CPAN as file: $CPAN/authors/id/A/AN/ANDK/patches/Parse-RecDescent-1.94-SADAHIRO-01.patch.gz size: 1800 bytes md5: a44d52e2cb43b4433f4fe39b0d47c83c This is not to steal you the show, just a test bed for the upcoming CPAN.pm release. I hope that's OK for you. -- andreas
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Fri, 03 Nov 2006 16:23:13 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 2.5k
On Thu, 02 Nov 2006 22:55:23 +0900, SADAHIRO Tomoyuki wrote Show quoted text
> On Thu, 2 Nov 2006 09:23:53 +0100, demerphq wrote
> > > > This rings a bell for me, didnt we have an issue related to this > > recently where the encoding wasnt cascaded into an eval? > > > > Note that backtrack stack, the error is NOT being emitted by the regex > > engine, this is happening in the lexing of the eval.
> > I think a chunk (see below) at the bottom of S_regatom() > seems problematic.
How to crash perl in this case: 1. When SIZE_ONLY is true, 1.1. reg_node doesn't set ret's type; 1.2. PL_regkind[OP(ret)] at line 6381 is never EXACT; 1.3. thus growing RExC_size at 6402 is never performed. 2. When SIZE_ONLY is false, 2.1. sv_recode_to_utf8 converts the string from ä to U+FFFD; 2.2. Copy at 6398 overruns STRING(ret) that is shorter than newlen. regcomp.c, 6379-6404 79: /* If the encoding pragma is in effect recode the text of 80: * any EXACT-kind nodes. */ 81: if (ret && PL_encoding && PL_regkind[OP(ret)] == EXACT) { 82: const STRLEN oldlen = STR_LEN(ret); 83: SV * const sv = sv_2mortal(newSVpvn(STRING(ret), oldlen)); 84: 85: if (RExC_utf8) 86: SvUTF8_on(sv); 87: if (sv_utf8_downgrade(sv, TRUE)) { 88: const char * const s = sv_recode_to_utf8(sv, PL_encoding); 89: const STRLEN newlen = SvCUR(sv); 90: 91: if (SvUTF8(sv)) 92: RExC_utf8 = 1; 93: if (!SIZE_ONLY) { 94: GET_RE_DEBUG_FLAGS_DECL; 95: DEBUG_COMPILE_r(PerlIO_printf(Perl_debug_log, "recode %*s to %*s\n", 96: (int)oldlen, STRING(ret), 97: (int)newlen, s)); 98: Copy(s, STRING(ret), newlen, char); 99: STR_LEN(ret) += newlen - oldlen; 00: RExC_emit += STR_SZ(newlen) - STR_SZ(oldlen); 01: } else 02: RExC_size += STR_SZ(newlen) - STR_SZ(oldlen); 03: } 04: } When the above chunk is removed, following tests in ext/Encode/t/encoding.t come to fail (cf. http://public.activestate.com/cgi-bin/perlbrowse/12864 ): print "not " unless "\x{3AF}" =~ /\xDF/; print "ok 17\n"; print "not " unless "\xDF" =~ /\xDF/; print "ok 18\n"; Perhaps it is necessary to interpret \xHH in regex as encoded in the encoding specified by PL_encoding, then call sv_recode_to_utf8 for \xHH, and get the corresponding Unicode character. (for example \xDF to U+03AF) Note: real literals like /Gebäude/ are converted to unicode by calling sv_recode_to_utf8() from scan_const() in toke.c. But metacharacters like /\xHH/ are skipped in scan_const(), and parsed by regatom() and regclass() in regcomp.c. P.S. though the fix is not yet, a test suite for this report is attached to this mail as encoding-test.gz Regards, SADAHIRO Tomoyuki
Download encoding-test.gz
application/octet-stream 454b

Message body not shown because it is not plain text.

CC: perl5-porters [...] perl.org, Damian Conway <damian [...] conway.org>
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Fri, 03 Nov 2006 16:51:50 +0900
To: andreas.koenig.gmwojprw [...] franz.ak.mind.de (Andreas J. Koenig)
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 926b
On Fri, 03 Nov 2006 06:45:04 +0100, Andreas J. Koenig wrote Show quoted text
> (TheDamian added to CC) >
> >>>>> On Fri, 03 Nov 2006 12:35:01 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com> said:
>
> > On Thu, 2 Nov 2006 18:24:54 +0100, Dominic Dunlop wrote
>
> > diff -urN Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Parse-RecDescent-new/lib/Parse/RecDescent.pm > > --- Parse-RecDescent-1.94/lib/Parse/RecDescent.pm Wed Apr 09 17:29:37 2003 > > +++ Parse-RecDescent-new/lib/Parse/RecDescent.pm Fri Nov 03 12:27:33 2006
> > I've also put the patch on CPAN as > > file: $CPAN/authors/id/A/AN/ANDK/patches/Parse-RecDescent-1.94-SADAHIRO-01.patch.gz > size: 1800 bytes > md5: a44d52e2cb43b4433f4fe39b0d47c83c > > This is not to steal you the show, just a test bed for the upcoming > CPAN.pm release. I hope that's OK for you.
Thank you. I'm wondering how this module is maintained currently. Regards, SADAHIRO Tomoyuki
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Sat, 04 Nov 2006 21:02:10 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
On Fri, 03 Nov 2006 16:23:13 +0900, SADAHIRO Tomoyuki wrote Show quoted text
> regcomp.c, 6379-6404
Show quoted text
> When the above chunk is removed, following tests in ext/Encode/t/encoding.t > come to fail (cf. http://public.activestate.com/cgi-bin/perlbrowse/12864 ):
Show quoted text
> Perhaps it is necessary to interpret \xHH in regex as encoded > in the encoding specified by PL_encoding, then call sv_recode_to_utf8 > for \xHH, and get the corresponding Unicode character. > (for example \xDF to U+03AF)
Here is a patch (encoding.patch.gz). - if PL_encoding, regatom() recodes only octal and hexadecimal escapes. This avoids double recoding of raw literal characters. - regclass() also recodes escapes according to PL_encoding. Now escapes in a character class get aware of the current encoding. - function S_reg_recode() in regcomp.c is defined. - new test files, t/uni/greek.t and t/uni/latin2.t, are added. Though these changes concern the functionality of encoding.pm (that lives a dual life), they can't be backported without changes against the core. Hence all changes are placed in p5p's space. Regards, SADAHIRO Tomoyuki
Download encoding.patch.gz
application/octet-stream 4.6k

Message body not shown because it is not plain text.

Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Sat, 04 Nov 2006 21:53:50 +0900
To: perl5-porters [...] perl.org
From: SADAHIRO Tomoyuki <bqw10602 [...] nifty.com>
Download (untitled) / with headers
text/plain 958b
On Sat, 04 Nov 2006 21:02:10 +0900, SADAHIRO Tomoyuki wrote Show quoted text
> Here is a patch (encoding.patch.gz). > > - if PL_encoding, regatom() recodes only octal and hexadecimal escapes. > This avoids double recoding of raw literal characters. > - regclass() also recodes escapes according to PL_encoding. > Now escapes in a character class get aware of the current encoding. > - function S_reg_recode() in regcomp.c is defined. > - new test files, t/uni/greek.t and t/uni/latin2.t, are added. > > Though these changes concern the functionality of encoding.pm > (that lives a dual life), they can't be backported without changes > against the core. Hence all changes are placed in p5p's space.
Then the patch is renewed, encoding2.gz - Test suites in the above patch emit messeges including non-ASCII characters, but that might be troublesome on some environments. Therefore the difference is only in messeges from test suites. Regards, SADAHIRO Tomoyuki
Download encoding2.gz
application/octet-stream 4.7k

Message body not shown because it is not plain text.

CC: perl5-porters [...] perl.org
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Sat, 4 Nov 2006 17:54:34 +0100
To: "SADAHIRO Tomoyuki" <bqw10602 [...] nifty.com>
From: demerphq <demerphq [...] gmail.com>
On 11/4/06, SADAHIRO Tomoyuki <bqw10602@nifty.com> wrote: Show quoted text
> > On Sat, 04 Nov 2006 21:02:10 +0900, SADAHIRO Tomoyuki wrote >
> > Here is a patch (encoding.patch.gz). > > > > - if PL_encoding, regatom() recodes only octal and hexadecimal escapes. > > This avoids double recoding of raw literal characters. > > - regclass() also recodes escapes according to PL_encoding. > > Now escapes in a character class get aware of the current encoding. > > - function S_reg_recode() in regcomp.c is defined. > > - new test files, t/uni/greek.t and t/uni/latin2.t, are added. > > > > Though these changes concern the functionality of encoding.pm > > (that lives a dual life), they can't be backported without changes > > against the core. Hence all changes are placed in p5p's space.
> > Then the patch is renewed, encoding2.gz > > - Test suites in the above patch emit messeges including non-ASCII > characters, but that might be troublesome on some environments. > > Therefore the difference is only in messeges from test suites.
Sadahiro San++ Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: [perl #40641] crash with unicode characters in regex comment
Date: Sat, 4 Nov 2006 20:17:45 +0100
To: perl5-porters [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Download (untitled) / with headers
text/plain 1.5k
On Sat, 04 Nov 2006 21:53:50 +0900, SADAHIRO Tomoyuki <bqw10602@nifty.com> wrote: Show quoted text
> On Sat, 04 Nov 2006 21:02:10 +0900, SADAHIRO Tomoyuki wrote >
> > Here is a patch (encoding.patch.gz). > > > > - if PL_encoding, regatom() recodes only octal and hexadecimal escapes. > > This avoids double recoding of raw literal characters. > > - regclass() also recodes escapes according to PL_encoding. > > Now escapes in a character class get aware of the current encoding. > > - function S_reg_recode() in regcomp.c is defined. > > - new test files, t/uni/greek.t and t/uni/latin2.t, are added. > > > > Though these changes concern the functionality of encoding.pm > > (that lives a dual life), they can't be backported without changes > > against the core. Hence all changes are placed in p5p's space.
> > Then the patch is renewed, encoding2.gz
All tests successful (1 subtest UNEXPECTEDLY SUCCEEDED), 39 tests and 340 subtests skipped. Thanks, applied as Change #29204 Show quoted text
> - Test suites in the above patch emit messeges including non-ASCII > characters, but that might be troublesome on some environments. > > Therefore the difference is only in messeges from test suites.
-- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.9.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.0 & 10.1, AIX 4.3 & 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org