Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bleadperl v5.13.10-141-g2279eab breaks RJBS/Pod-Coverage-TrustPod-0.092832.tar.gz #11237

Closed
p5pRT opened this issue Apr 5, 2011 · 10 comments
Closed

Comments

@p5pRT
Copy link

p5pRT commented Apr 5, 2011

Migrated from rt.perl.org#87812 (status was 'resolved')

Searchable as RT87812$

@p5pRT
Copy link
Author

p5pRT commented Apr 5, 2011

From @andk

rt.cpan ticket​:


  https://rt.cpan.org/Ticket/Display.html?id=67040

git bisect​:


  commit 2279eab
  Author​: Karl Williamson <public@​khwilliamson.com>
  Date​: Mon Feb 28 08​:38​:30 2011 -0700

  regcomp.c​: Remove temporary code

perl -V​:


  Summary of my perl5 (revision 5 version 13 subversion 10) configuration​:
  Commit id​: 2279eab
  Platform​:
  osname=linux, osvers=2.6.32-5-amd64, archname=x86_64-linux
  uname='linux k81 2.6.32-5-amd64 #1 smp wed jan 12 03​:40​:32 utc 2011 x86_64 gnulinux '
  config_args='-Dprefix=/home/src/perl/repoperls/installed-perls/perl/v5.13.10-141-g2279eab -Dinstallusrbinperl=n -Uversiononly -Dusedevel -des -Ui_db -Uuseithreads -Uuselongdouble -DDEBUGGING=-g'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.5.2', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
  libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.11.2.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.11.2'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector'

  Characteristics of this binary (from libperl)​:
  Compile-time options​: PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP PERL_USE_DEVEL
  USE_64_BIT_ALL USE_64_BIT_INT USE_LARGE_FILES
  USE_PERLIO USE_PERL_ATOF
  Built under linux
  Compiled at Apr 4 2011 00​:52​:53
  @​INC​:
  /home/src/perl/repoperls/installed-perls/perl/v5.13.10-141-g2279eab/lib/site_perl/5.13.10/x86_64-linux
  /home/src/perl/repoperls/installed-perls/perl/v5.13.10-141-g2279eab/lib/site_perl/5.13.10
  /home/src/perl/repoperls/installed-perls/perl/v5.13.10-141-g2279eab/lib/5.13.10/x86_64-linux
  /home/src/perl/repoperls/installed-perls/perl/v5.13.10-141-g2279eab/lib/5.13.10
  .

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Apr 8, 2011

From @rjbs

I have fixed Pod​::Coverage​::TrustPod and khw is investigating the
underlying changes.

@p5pRT
Copy link
Author

p5pRT commented Apr 8, 2011

@rjbs - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 8, 2011

From @khwilliamson

On 04/05/2011 12​:43 AM, (Andreas J. Koenig) (via RT) wrote​:

# New Ticket Created by (Andreas J. Koenig)
# Please include the string​: [perl #87812]
# in the subject line of all future correspondence about this issue.
#<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=87812>

rt.cpan ticket​:
---------------

https://rt.cpan.org/Ticket/Display.html?id=67040

git bisect​:
-----------

commit 2279eab
Author​: Karl Williamson<public@​khwilliamson.com>
Date​: Mon Feb 28 08​:38​:30 2011 -0700

regcomp.c​: Remove temporary code

I have investigated this and found the cause, but don't know the best
solution for 5.14.

The problem is that the failing .t is changing @​INC to exclude 'lib', so
that utf8.pm is not found, and it turns out that changes in 5.14 are
causing utf8.pm to get loaded where previously it wasn't, and, actually,
loading it actually isn't needed.

Those 5.14 changes fix Unicode matching. The problem is that /[a-z]/i
can match Unicode characters, and the code is not smart enough to
distinguish between matching specific characters and matching a swash
that has to be read in, so it tries to read it in, unsuccessfully.

For this module, rjbs is thinking that the @​INC restriction isn't
necessary, and so we can get this immediate failure to pass.

And, my understanding is that a normal program shouldn't expect to work
if its @​INC doesn't include lib, so this isn't a correctness issue IMO.
  But it may be considered to be a performance issue--there is a
slowdown in matching apparently innocuous regexes. (Earlier I caused
this slowdown during compilation of them, but fixed that, so the
slowdown now occurs only during execution, and may not happen, depending
on the string being matched.)

My plan all along has been to change things in 5.16, with the side
effect that this would not be an issue there.

Looking at the code, it's doing something to avoid what I consider a
design flaw in the optimizer (but perhaps there is an undocumented
reason for that design that I'm unaware of). There is another way to do
it that is somewhat less clean, but should avoid this problem.

There are two other ways to do things. One is to scan the list of
matching things at execution time to distinguish between things that
don't require a swash, from those that do. Another is to make that
distinction at compile time by keeping two lists instead of a combined one.

I can look into all this during the weekend. My sense is that this
performance issue should be fixed in 5.14.

@p5pRT
Copy link
Author

p5pRT commented Apr 9, 2011

From @khwilliamson

The attached patches should fix this problem. I have pushed them to
blead, but am attaching them here for people to review as well.
Again, this is a kludge around a design flaw in the optimizer, which
should go away in 5.16 as a result of other planned changes.

The reason that this issue previously passed all local tests is that the
failing .t excludes lib from @​INC, and otherwise this is a performance
issue as opposed to a functionality bug. I would like some advice as to
which core .t to put a test into. There are a number that restrict
@​INC, but nothing jumps out at me as an appropriate place. Perhaps a
new .t should be created, but I'm not sure it's even possible, given
that various of the included test files include lib.

@p5pRT
Copy link
Author

p5pRT commented Apr 9, 2011

From @khwilliamson

0004-regcomp.c-Shun-ANYOF_NONBITMAP_NON_UTF8.patch
From 2e1004ab949f37392a5a76e38c80da86b88ac538 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 9 Apr 2011 10:42:15 -0600
Subject: [PATCH 4/5] regcomp.c: Shun ANYOF_NONBITMAP_NON_UTF8

As noted in the comments in the code for this commit, this flag has
higher consequences than others when it is inappropriately set in the
synthetic start class.  This patch causes it to not be set unless there
is some path in the regex engine that needs it, but once set it is never
cleared.  This results in a different set of false positives than
currently, but the current set can have this set even if there is no
path in the engine that needs it.
---
 regcomp.c |   55 +++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index d639c58..e626489 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -730,15 +730,7 @@ S_cl_anything(const RExC_state_t *pRExC_state, struct regnode_charclass_class *c
 
     ANYOF_BITMAP_SETALL(cl);
     cl->flags = ANYOF_CLASS|ANYOF_EOS|ANYOF_UNICODE_ALL
-		|ANYOF_LOC_NONBITMAP_FOLD|ANYOF_NON_UTF8_LATIN1_ALL
-		    /* Even though no bitmap is in use here, we need to set
-		     * the flag below so an AND with a node that does have one
-		     * doesn't lose that one.  The flag should get cleared if
-		     * the other one doesn't; and the code in regexec.c is
-		     * structured so this being set when not needed does no
-		     * harm.  It seemed a little cleaner to set it here than do
-		     * a special case in cl_and() */
-		|ANYOF_NONBITMAP_NON_UTF8;
+		|ANYOF_LOC_NONBITMAP_FOLD|ANYOF_NON_UTF8_LATIN1_ALL;
 
     /* If any portion of the regex is to operate under locale rules,
      * initialization includes it.  The reason this isn't done for all regexes
@@ -841,6 +833,8 @@ S_cl_and(struct regnode_charclass_class *cl,
 	}
     }
     else {   /* and'd node is not inverted */
+	U8 outside_bitmap_but_not_utf8; /* Temp variable */
+
 	if (! ANYOF_NONBITMAP(and_with)) {
 
             /* Here 'and_with' doesn't match anything outside the bitmap
@@ -859,14 +853,18 @@ S_cl_and(struct regnode_charclass_class *cl,
 	    /* Here, 'and_with' does match something outside the bitmap, and cl
 	     * doesn't have a list of things to match outside the bitmap.  If
              * cl can match all code points above 255, the intersection will
-             * be those above-255 code points that 'and_with' matches.  There
-             * may be false positives from code points in 'and_with' that are
-             * outside the bitmap but below 256, but those get sorted out
-             * after the synthetic start class succeeds).  If cl can't match
-             * all Unicode code points, it means here that it can't match *
-             * anything outside the bitmap, so we leave the bitmap empty */
+             * be those above-255 code points that 'and_with' matches.  If cl
+             * can't match all Unicode code points, it means that it can't
+             * match anything outside the bitmap (since the 'if' that got us
+             * into this block tested for that), so we leave the bitmap empty.
+             */
 	    if (cl->flags & ANYOF_UNICODE_ALL) {
 		ARG_SET(cl, ARG(and_with));
+
+                /* and_with's ARG may match things that don't require UTF8.
+                 * And now cl's will too, in spite of this being an 'and'.  See
+                 * the comments below about the kludge */
+		cl->flags |= and_with->flags & ANYOF_NONBITMAP_NON_UTF8;
 	    }
 	}
 	else {
@@ -876,8 +874,33 @@ S_cl_and(struct regnode_charclass_class *cl,
 	}
 
 
-        /* Take the intersection of the two sets of flags */
+        /* Take the intersection of the two sets of flags.  However, the
+         * ANYOF_NONBITMAP_NON_UTF8 flag is treated as an 'or'.  This is a
+         * kludge around the fact that this flag is not treated like the others
+         * which are initialized in cl_anything().  The way the optimizer works
+         * is that the synthetic start class (SSC) is initialized to match
+         * anything, and then the first time a real node is encountered, its
+         * values are AND'd with the SSC's with the result being the values of
+         * the real node.  However, there are paths through the optimizer where
+         * the AND never gets called, so those initialized bits are set
+         * inappropriately, which is not usually a big deal, as they just cause
+         * false positives in the SSC, which will just mean a probably
+         * imperceptible slow down in execution.  However this bit has a
+         * higher false positive consequence in that it can cause utf8.pm,
+         * utf8_heavy.pl ... to be loaded when not necessary, which is a much
+         * bigger slowdown and also causes significant extra memory to be used.
+         * In order to prevent this, the code now takes a different tack.  The
+         * bit isn't set unless some part of the regular expression needs it,
+         * but once set it won't get cleared.  This means that these extra
+         * modules won't get loaded unless there was some path through the
+         * pattern that would have required them anyway, and  so any false
+         * positives that occur by not ANDing them out when they could be
+         * aren't as severe as they would be if we treated this bit like all
+         * the others */
+        outside_bitmap_but_not_utf8 = (cl->flags | and_with->flags)
+                                      & ANYOF_NONBITMAP_NON_UTF8;
 	cl->flags &= and_with->flags;
+	cl->flags |= outside_bitmap_but_not_utf8;
     }
 }
 
-- 
1.7.1

@p5pRT
Copy link
Author

p5pRT commented Apr 9, 2011

From @khwilliamson

0005-PATCH-perl-87812-BBC-breaks-Pod-Coverage-TrustPod.patch
From 80265e1f175494ba2c5b34844125685f0d67fd38 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 9 Apr 2011 11:03:37 -0600
Subject: [PATCH 5/5] PATCH: [perl #87812] BBC breaks Pod::Coverage::TrustPod

This patch completes the fixing of this problem.  The problem is that
the failing .t set @INC to exclude lib, and hence couldn't find utf8.pm,
which 5.14 was requiring in places where it previously didn't.  This
patch finishes the job of not requiring utf8.pm in so many places as
were inadvertently added in 5.14.  Commit
3ad98780b4bded02c371c83a668dc8f323e57718 started the job.

This patch changes regcomp.c to not set ANYOF_NONBITMAP_NON_UTF8 where
it inappropriately was.  I don't know what I was thinking when I
originally did what this changes.  In order to match outside the bitmap,
these characters all must match something that requires utf8, such as a
LIGATURE FI.
---
 regcomp.c |   14 +++++---------
 1 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/regcomp.c b/regcomp.c
index e626489..ade999c 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -9426,21 +9426,17 @@ S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, regnode* node, const U8
 	    case 'I': case 'i':
 	    case 'L': case 'l':
 	    case 'T': case 't':
-		/* These all are targets of multi-character folds, which can
-		 * occur with only non-Latin1 characters in the fold, so they
-		 * can match if the target string isn't UTF-8 */
-		ANYOF_FLAGS(node) |= ANYOF_NONBITMAP_NON_UTF8;
-		break;
 	    case 'A': case 'a':
 	    case 'H': case 'h':
 	    case 'J': case 'j':
 	    case 'N': case 'n':
 	    case 'W': case 'w':
 	    case 'Y': case 'y':
-		/* These all are targets of multi-character folds, which occur
-		 * only with a non-Latin1 character as part of the fold, so
-		 * they can't match unless the target string is in UTF-8, so no
-		 * action here is necessary */
+                /* These all are targets of multi-character folds from code
+                 * points that require UTF8 to express, so they can't match
+                 * unless the target string is in UTF-8, so no action here is
+                 * necessary, as regexec.c properly handles the general case
+                 * for UTF-8 matching */
 		break;
 	    default:
 		/* Use deprecated warning to increase the chances of this
-- 
1.7.1

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2011

From @cpansprout

On Sat Apr 09 10​:36​:42 2011, public@​khwilliamson.com wrote​:

The attached patches should fix this problem. I have pushed them to
blead, but am attaching them here for people to review as well.
Again, this is a kludge around a design flaw in the optimizer, which
should go away in 5.16 as a result of other planned changes.

The reason that this issue previously passed all local tests is that the
failing .t excludes lib from @​INC, and otherwise this is a performance
issue as opposed to a functionality bug. I would like some advice as to
which core .t to put a test into. There are a number that restrict
@​INC, but nothing jumps out at me as an appropriate place. Perhaps a
new .t should be created, but I'm not sure it's even possible, given
that various of the included test files include lib.

There may be a better way, but the *easiest* way is to write a test
script that uses no test libraries​:

#!perl

print "1..1\n";

... do stuff ...

print "not" if exists $INC{"utf8.pm"};
print "ok 1\n";

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2011

From @khwilliamson

This has been fixed by commits
f580a93
3ad9878
eac006c

--Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Apr 11, 2011

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant