Skip Menu |
Report information
Id: 72998
Status: resolved
Priority: 0/
Queue: perl5

Owner: khw <khw [at] cpan.org>
Requestors: public [at] khwilliamson.com
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: Unicode
Perl Version: (no value)
Fixed In: (no value)



Subject: 5.8 Regression: some regexes with multi-char folds loop
Date: Sun, 21 Feb 2010 20:07:06 -0700
To: perlbug [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.3k
This is a bug report for perl from khw@khw-desktop.nonet, generated with the help of perlbug 1.39 running under perl 5.11.4. ----------------------------------------------------------------- "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i both loop. This bug is in 5.10 but not 5.8. I don't know the full circumstances, but preliminarily, it appears to affect patterns which are in utf8 with the fold a sequence of ASCII characters, and when input with \N. This causes a CURLYM regnode to be generated, instead of a PLUS node in the examples above. But it fails with * as well. ----------------------------------------------------------------- --- Flags: category=core severity=medium --- This perlbug was built using Perl 5.11.5 - Fri Feb 19 08:36:52 MST 2010 It is being executed now by Perl 5.11.4 - Sat Jan 23 13:32:28 MST 2010. Site configuration information for perl 5.11.4: Configured by khw at Sat Jan 23 13:32:28 MST 2010. Summary of my perl5 (revision 5 version 11 subversion 4) configuration: Commit id: d0a76353c03de55a7ec9f5f128afee8d3c83e23f Platform: osname=linux, osvers=2.6.27-16-generic, archname=i686-linux uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1 17:56:54 utc 2009 i686 gnulinux ' config_args='-s -d -Dprefix=/home/khw/blead -Dusedevel -D'optimize=-g3' -A'optimize=-g3' -A'optimize=-O0'' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O0 -g3', cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.3.2', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.8.90.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.8.90' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -g3 -g3 -O0 -L/usr/local/lib -fstack-protector' Locally applied patches: --- @INC for perl 5.11.4: /home/khw/blead/lib/perl5/site_perl/5.11.4/i686-linux /home/khw/blead/lib/perl5/site_perl/5.11.4 /home/khw/blead/lib/perl5/5.11.4/i686-linux /home/khw/blead/lib/perl5/5.11.4 /home/khw/blead/lib/perl5/site_perl . --- Environment for perl 5.11.4: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/opt/real/RealPlayer:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Tue, 23 Feb 2010 09:14:36 -0700
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 907b
karl williamson (via RT) wrote: Show quoted text
> "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i > "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i > > > both loop. This bug is in 5.10 but not 5.8. I don't know the full > circumstances, but preliminarily, it appears to affect patterns which > are in utf8 with the fold a sequence of ASCII characters, and when input > with \N. This causes a CURLYM regnode to be generated, instead of a > PLUS node in the examples above. But it fails with * as well. > -----------------------------------------------------------------
I should have mentioned that this is more likely to happen in 5.11.5 than before because of the fixes for \N{}. Now this error will happen for constructs like: "\x{FB00}" =~ /\N{U+66}+/ because the \N{U+66} now forces the pattern into utf8; or "\X{FB00}" =~ /\N{LATIN SMALL LETTER F}/ because the \N{} actually will compile correctly, unlike before.
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Wed, 24 Feb 2010 13:15:04 -0500
To: perl5-porters [...] perl.org
From: Jesse Vincent <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 617b
Show quoted text
> "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i > "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i > > > both loop. This bug is in 5.10 but not 5.8. I don't know the full > circumstances, but preliminarily, it appears to affect patterns which > are in utf8 with the fold a sequence of ASCII characters, and when input > with \N. This causes a CURLYM regnode to be generated, instead of a > PLUS node in the examples above. But it fails with * as well.
Even though this is likely to happen "more" in 5.11.5 than it did in 5.12, I'm not sure it's serious enough that I'd call it a blocker for the .0 release. -Jesse
CC: perl5-porters [...] perl.org
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Thu, 25 Feb 2010 16:31:33 -0700
To: Jesse Vincent <jesse [...] fsck.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 757b
Jesse Vincent wrote: Show quoted text
>> "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i >> "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i >> >> >> both loop. This bug is in 5.10 but not 5.8. I don't know the full >> circumstances, but preliminarily, it appears to affect patterns which >> are in utf8 with the fold a sequence of ASCII characters, and when input >> with \N. This causes a CURLYM regnode to be generated, instead of a >> PLUS node in the examples above. But it fails with * as well.
> > Even though this is likely to happen "more" in 5.11.5 than it did in > 5.12, I'm not sure it's serious enough that I'd call it a blocker for > the .0 release. > > -Jesse >
I'm not either. I'm thinking that it should be in the known problems of the delta. N'est pas?
CC: perl5-porters [...] perl.org
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Thu, 25 Feb 2010 19:27:38 -0500
To: karl williamson <public [...] khwilliamson.com>
From: jesse <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 903b
On Thu, Feb 25, 2010 at 04:31:33PM -0700, karl williamson wrote: Show quoted text
> Jesse Vincent wrote:
> >>"\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i > >>"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i > >> > >> > >>both loop. This bug is in 5.10 but not 5.8. I don't know the full > >>circumstances, but preliminarily, it appears to affect patterns which > >>are in utf8 with the fold a sequence of ASCII characters, and when input > >>with \N. This causes a CURLYM regnode to be generated, instead of a > >>PLUS node in the examples above. But it fails with * as well.
> > > >Even though this is likely to happen "more" in 5.11.5 than it did in > >5.12, I'm not sure it's serious enough that I'd call it a blocker for > >the .0 release. > > > >-Jesse > >
> > I'm not either. I'm thinking that it should be in the known problems of > the delta. N'est pas?
Sure. Do you have opinions on ideal phrasing? --
CC: perl5-porters [...] perl.org
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Thu, 25 Feb 2010 19:01:26 -0700
To: jesse <jesse [...] fsck.com>
From: karl williamson <public [...] khwilliamson.com>
jesse wrote: Show quoted text
> > > On Thu, Feb 25, 2010 at 04:31:33PM -0700, karl williamson wrote:
>> Jesse Vincent wrote:
>>>> "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i >>>> "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i >>>> >>>> >>>> both loop. This bug is in 5.10 but not 5.8. I don't know the full >>>> circumstances, but preliminarily, it appears to affect patterns which >>>> are in utf8 with the fold a sequence of ASCII characters, and when input >>>> with \N. This causes a CURLYM regnode to be generated, instead of a >>>> PLUS node in the examples above. But it fails with * as well.
>>> Even though this is likely to happen "more" in 5.11.5 than it did in >>> 5.12, I'm not sure it's serious enough that I'd call it a blocker for >>> the .0 release. >>> >>> -Jesse >>>
>> I'm not either. I'm thinking that it should be in the known problems of >> the delta. N'est pas?
> > Sure. Do you have opinions on ideal phrasing? >
If I could phrase things ideally, I might have been a poet and not a programmer. (But then again that doesn't pay as well :)). I'll submit a patch as a starting point.
Subject: Re: [perl #72998] 5.8 Regression: some regexes with multi-char folds loop
Date: Tue, 02 Mar 2010 14:48:55 -0700
To: perl5-porters [...] perl.org, Yves Orton <yves.orton [...] booking.com>, demerphq <demerphq [...] gmail.com>, jesse <jesse [...] fsck.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.7k
karl williamson (via RT) wrote: Show quoted text
> > This is a bug report for perl from khw@khw-desktop.nonet, > generated with the help of perlbug 1.39 running under perl 5.11.4. > > > ----------------------------------------------------------------- > "\x{100}\x{00DF}" =~ /\x{100}\N{U+73}+/i > "\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i > > > both loop. This bug is in 5.10 but not 5.8. I don't know the full > circumstances, but preliminarily, it appears to affect patterns which > are in utf8 with the fold a sequence of ASCII characters, and when input > with \N. This causes a CURLYM regnode to be generated, instead of a > PLUS node in the examples above. But it fails with * as well. > -----------------------------------------------------------------
I looked at this enough to see part of what is causing it, and that it apparently isn't easy to fix; perhaps it would be to someone more familiar with regexec.c. The problem I suspect stems from the attempt to shoehorn in multi-char folds into code that never dreamed of the possibility. The case fold of FB00 is 'ff'. So /f+/i should match it (\N{U+66} is another way of writing 'f'). The problem is that each f matches only half of FB00, but the code doesn't know how to deal with half a match, so it matches the half, and then leaves the pointer to FB00 unchanged. In a way this is not a regression from 5.8, because it didn't work properly in 5.8 either; it just failed quickly without looping. If you replace the \N{U+66} with the equivalent \x{66} or just f, it also doesn't match properly, but fails quickly. The difference is that the \N{} construct forces it into using CURLYM instead of another type of CURLY, and CURLYM loops, and the others fail quickly. Hopefully, Yves' redesign of multi-char folding will fix this whole thing.
Subject: PATCH: [perl #72998] regex looping; regression from 5.8
Date: Tue, 13 Apr 2010 21:37:11 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 121b
In 5.8 it doesn't match correctly either, but it doesn't loop forever. This restores that better but not idyllic state.
From fbf9950bdadcfe4479e137c54e2546b23a221061 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Tue, 13 Apr 2010 21:25:36 -0600 Subject: [PATCH] PATCH: [perl #72998] regex looping If a character folds to multiple ones in case-insensitive matching, it should not match just one of those, or the regular expression can loop. This patch doesn't fix the problem of it not matching correctly, but it simply causes it to fail now, rather than much much later, by disallowing the case where it thought it matched but did not advance the pointer to the next character. Fixing the problem in general of multi-char folding characters is a much bigger issue, to be addressed later. --- t/re/re.t | 10 +++++++++- utf8.c | 3 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/t/re/re.t b/t/re/re.t index 87965f2..aae8120 100644 --- a/t/re/re.t +++ b/t/re/re.t @@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){ } is(regnames_count(),3); } + + { + # Bug #72998; this can loop + watchdog(2); + eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i'; + pass("Didn't loop"); + } + # New tests above this line, don't forget to update the test count below! -BEGIN { plan tests => 18 } +BEGIN { plan tests => 19 } # No tests here! diff --git a/utf8.c b/utf8.c index 9ed0663..1a6077c 100644 --- a/utf8.c +++ b/utf8.c @@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const /* A match is defined by all the scans that specified * an explicit length reaching their final goals. */ - match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2); + match = (n1 == 0 && n2 == 0 /* Must not match partial char; Bug #72998 */ + && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2)); if (match) { if (pe1) -- 1.5.6.3
Subject: Revised PATCH: [perl #72998] regex looping; regression from 5.8
Date: Wed, 14 Apr 2010 19:35:36 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 560b
This is a small revision to the patch I submitted yesterday. The differences are that a comment has been added to the test case telling that it should be kept last in the file, and the commit message is much more detailed about the problem this patches. I actually have a more extensive patch prepared that cleans up the whole routine to be faster and clearer, but I think this should go in 5.12.1, as it solves not a Perl crash, but a hung Perl, which is perhaps worse, and I thought it prudent to submit a minimal patch for the maintenance release.
From 93eb9a395ae991368a9b7fd3691c13791b7d5cf5 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Tue, 13 Apr 2010 21:25:36 -0600 Subject: [PATCH] PATCH: [perl #72998] regex looping If a character folds to multiple ones in case-insensitive matching, it should not match just one of those, or the regular expression can loop. For example, \N{LATIN SMALL LIGATURE FF} folds to 'ff', and so "\N{LATIN SMALL LIGATURE FF}" =~ /f+/i should match. Prior to this patch, this function returned that there is a match, but left the matching string pointer at the beginning of the "\N{LATIN SMALL LIGATURE FF}" because it doesn't make sense to match just half a character, and at this level it doesn't know about the '+'. This leaves things in an inconsistent state, with the reporting of a match, but the input pointer unchanged, the result of which is a loop. I don't know how to fix this so that it correctly matches, and there are semantic issues with doing so. For example, if "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i matches, then one would think that so should "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i But $1 and $2 don't really make sense here, since they both refer to the half of the same character. So this patch just returns failure if only a partial character is matched. That leaves things consistent, and solves the problem of looping, so that Perl doesn't hang on such a construct, but leaves the ultimate solution for another day. --- t/re/re.t | 10 +++++++++- utf8.c | 3 ++- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/t/re/re.t b/t/re/re.t index 87965f2..249c6dd 100644 --- a/t/re/re.t +++ b/t/re/re.t @@ -51,6 +51,14 @@ if ('1234'=~/(?:(?<A>\d)|(?<C>!))(?<B>\d)(?<A>\d)(?<B>\d)/){ } is(regnames_count(),3); } + + { # Keep this test last, as whole script will be interrupted if times out + # Bug #72998; this can loop + watchdog(2); + eval '"\x{100}\x{FB00}" =~ /\x{100}\N{U+66}+/i'; + pass("Didn't loop"); + } + # New tests above this line, don't forget to update the test count below! -BEGIN { plan tests => 18 } +BEGIN { plan tests => 19 } # No tests here! diff --git a/utf8.c b/utf8.c index 9ed0663..1a6077c 100644 --- a/utf8.c +++ b/utf8.c @@ -2609,7 +2609,8 @@ Perl_ibcmp_utf8(pTHX_ const char *s1, char **pe1, register UV l1, bool u1, const /* A match is defined by all the scans that specified * an explicit length reaching their final goals. */ - match = (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2); + match = (n1 == 0 && n2 == 0 /* Must not match partial char; Bug #72998 */ + && (f1 == 0 || p1 == f1) && (f2 == 0 || p2 == f2)); if (match) { if (pe1) -- 1.5.6.3
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Revised PATCH: [perl #72998] regex looping; regression from 5.8
Date: Thu, 15 Apr 2010 10:32:27 +0200
To: karl williamson <public [...] khwilliamson.com>
From: Rafael Garcia-Suarez <rgs [...] consttype.org>
Download (untitled) / with headers
text/plain 685b
On 15 April 2010 03:35, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> This is a small revision to the patch I submitted yesterday.  The > differences are that a comment has been added to the test case telling that > it should be kept last in the file, and the commit message is much more > detailed about the problem this patches. > > I actually have a more extensive patch prepared that cleans up the whole > routine to be faster and clearer, but I think this should go in 5.12.1, as > it solves not a Perl crash, but a hung Perl, which is perhaps worse, and I > thought it prudent to submit a minimal patch for the maintenance release.
I agree. Thanks, applied to bleadperl.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org