New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bleadperl v5.13.10-141-g2279eab breaks RJBS/Pod-Coverage-TrustPod-0.092832.tar.gz #11237
Comments
From @andkrt.cpan ticket: https://rt.cpan.org/Ticket/Display.html?id=67040 git bisect: commit 2279eab regcomp.c: Remove temporary code perl -V: Summary of my perl5 (revision 5 version 13 subversion 10) configuration: Characteristics of this binary (from libperl): -- |
From @rjbsI have fixed Pod::Coverage::TrustPod and khw is investigating the |
@rjbs - Status changed from 'new' to 'open' |
From @khwilliamsonOn 04/05/2011 12:43 AM, (Andreas J. Koenig) (via RT) wrote:
I have investigated this and found the cause, but don't know the best The problem is that the failing .t is changing @INC to exclude 'lib', so Those 5.14 changes fix Unicode matching. The problem is that /[a-z]/i For this module, rjbs is thinking that the @INC restriction isn't And, my understanding is that a normal program shouldn't expect to work My plan all along has been to change things in 5.16, with the side Looking at the code, it's doing something to avoid what I consider a There are two other ways to do things. One is to scan the list of I can look into all this during the weekend. My sense is that this |
From @khwilliamsonThe attached patches should fix this problem. I have pushed them to The reason that this issue previously passed all local tests is that the |
From @khwilliamson0004-regcomp.c-Shun-ANYOF_NONBITMAP_NON_UTF8.patchFrom 2e1004ab949f37392a5a76e38c80da86b88ac538 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 9 Apr 2011 10:42:15 -0600
Subject: [PATCH 4/5] regcomp.c: Shun ANYOF_NONBITMAP_NON_UTF8
As noted in the comments in the code for this commit, this flag has
higher consequences than others when it is inappropriately set in the
synthetic start class. This patch causes it to not be set unless there
is some path in the regex engine that needs it, but once set it is never
cleared. This results in a different set of false positives than
currently, but the current set can have this set even if there is no
path in the engine that needs it.
---
regcomp.c | 55 +++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 39 insertions(+), 16 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index d639c58..e626489 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -730,15 +730,7 @@ S_cl_anything(const RExC_state_t *pRExC_state, struct regnode_charclass_class *c
ANYOF_BITMAP_SETALL(cl);
cl->flags = ANYOF_CLASS|ANYOF_EOS|ANYOF_UNICODE_ALL
- |ANYOF_LOC_NONBITMAP_FOLD|ANYOF_NON_UTF8_LATIN1_ALL
- /* Even though no bitmap is in use here, we need to set
- * the flag below so an AND with a node that does have one
- * doesn't lose that one. The flag should get cleared if
- * the other one doesn't; and the code in regexec.c is
- * structured so this being set when not needed does no
- * harm. It seemed a little cleaner to set it here than do
- * a special case in cl_and() */
- |ANYOF_NONBITMAP_NON_UTF8;
+ |ANYOF_LOC_NONBITMAP_FOLD|ANYOF_NON_UTF8_LATIN1_ALL;
/* If any portion of the regex is to operate under locale rules,
* initialization includes it. The reason this isn't done for all regexes
@@ -841,6 +833,8 @@ S_cl_and(struct regnode_charclass_class *cl,
}
}
else { /* and'd node is not inverted */
+ U8 outside_bitmap_but_not_utf8; /* Temp variable */
+
if (! ANYOF_NONBITMAP(and_with)) {
/* Here 'and_with' doesn't match anything outside the bitmap
@@ -859,14 +853,18 @@ S_cl_and(struct regnode_charclass_class *cl,
/* Here, 'and_with' does match something outside the bitmap, and cl
* doesn't have a list of things to match outside the bitmap. If
* cl can match all code points above 255, the intersection will
- * be those above-255 code points that 'and_with' matches. There
- * may be false positives from code points in 'and_with' that are
- * outside the bitmap but below 256, but those get sorted out
- * after the synthetic start class succeeds). If cl can't match
- * all Unicode code points, it means here that it can't match *
- * anything outside the bitmap, so we leave the bitmap empty */
+ * be those above-255 code points that 'and_with' matches. If cl
+ * can't match all Unicode code points, it means that it can't
+ * match anything outside the bitmap (since the 'if' that got us
+ * into this block tested for that), so we leave the bitmap empty.
+ */
if (cl->flags & ANYOF_UNICODE_ALL) {
ARG_SET(cl, ARG(and_with));
+
+ /* and_with's ARG may match things that don't require UTF8.
+ * And now cl's will too, in spite of this being an 'and'. See
+ * the comments below about the kludge */
+ cl->flags |= and_with->flags & ANYOF_NONBITMAP_NON_UTF8;
}
}
else {
@@ -876,8 +874,33 @@ S_cl_and(struct regnode_charclass_class *cl,
}
- /* Take the intersection of the two sets of flags */
+ /* Take the intersection of the two sets of flags. However, the
+ * ANYOF_NONBITMAP_NON_UTF8 flag is treated as an 'or'. This is a
+ * kludge around the fact that this flag is not treated like the others
+ * which are initialized in cl_anything(). The way the optimizer works
+ * is that the synthetic start class (SSC) is initialized to match
+ * anything, and then the first time a real node is encountered, its
+ * values are AND'd with the SSC's with the result being the values of
+ * the real node. However, there are paths through the optimizer where
+ * the AND never gets called, so those initialized bits are set
+ * inappropriately, which is not usually a big deal, as they just cause
+ * false positives in the SSC, which will just mean a probably
+ * imperceptible slow down in execution. However this bit has a
+ * higher false positive consequence in that it can cause utf8.pm,
+ * utf8_heavy.pl ... to be loaded when not necessary, which is a much
+ * bigger slowdown and also causes significant extra memory to be used.
+ * In order to prevent this, the code now takes a different tack. The
+ * bit isn't set unless some part of the regular expression needs it,
+ * but once set it won't get cleared. This means that these extra
+ * modules won't get loaded unless there was some path through the
+ * pattern that would have required them anyway, and so any false
+ * positives that occur by not ANDing them out when they could be
+ * aren't as severe as they would be if we treated this bit like all
+ * the others */
+ outside_bitmap_but_not_utf8 = (cl->flags | and_with->flags)
+ & ANYOF_NONBITMAP_NON_UTF8;
cl->flags &= and_with->flags;
+ cl->flags |= outside_bitmap_but_not_utf8;
}
}
--
1.7.1
|
From @khwilliamson0005-PATCH-perl-87812-BBC-breaks-Pod-Coverage-TrustPod.patchFrom 80265e1f175494ba2c5b34844125685f0d67fd38 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Sat, 9 Apr 2011 11:03:37 -0600
Subject: [PATCH 5/5] PATCH: [perl #87812] BBC breaks Pod::Coverage::TrustPod
This patch completes the fixing of this problem. The problem is that
the failing .t set @INC to exclude lib, and hence couldn't find utf8.pm,
which 5.14 was requiring in places where it previously didn't. This
patch finishes the job of not requiring utf8.pm in so many places as
were inadvertently added in 5.14. Commit
3ad98780b4bded02c371c83a668dc8f323e57718 started the job.
This patch changes regcomp.c to not set ANYOF_NONBITMAP_NON_UTF8 where
it inappropriately was. I don't know what I was thinking when I
originally did what this changes. In order to match outside the bitmap,
these characters all must match something that requires utf8, such as a
LIGATURE FI.
---
regcomp.c | 14 +++++---------
1 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/regcomp.c b/regcomp.c
index e626489..ade999c 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -9426,21 +9426,17 @@ S_set_regclass_bit_fold(pTHX_ RExC_state_t *pRExC_state, regnode* node, const U8
case 'I': case 'i':
case 'L': case 'l':
case 'T': case 't':
- /* These all are targets of multi-character folds, which can
- * occur with only non-Latin1 characters in the fold, so they
- * can match if the target string isn't UTF-8 */
- ANYOF_FLAGS(node) |= ANYOF_NONBITMAP_NON_UTF8;
- break;
case 'A': case 'a':
case 'H': case 'h':
case 'J': case 'j':
case 'N': case 'n':
case 'W': case 'w':
case 'Y': case 'y':
- /* These all are targets of multi-character folds, which occur
- * only with a non-Latin1 character as part of the fold, so
- * they can't match unless the target string is in UTF-8, so no
- * action here is necessary */
+ /* These all are targets of multi-character folds from code
+ * points that require UTF8 to express, so they can't match
+ * unless the target string is in UTF-8, so no action here is
+ * necessary, as regexec.c properly handles the general case
+ * for UTF-8 matching */
break;
default:
/* Use deprecated warning to increase the chances of this
--
1.7.1
|
From @cpansproutOn Sat Apr 09 10:36:42 2011, public@khwilliamson.com wrote:
There may be a better way, but the *easiest* way is to write a test #!perl print "1..1\n"; ... do stuff ... print "not" if exists $INC{"utf8.pm"}; |
From @khwilliamsonThis has been fixed by commits --Karl Williamson |
@khwilliamson - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#87812 (status was 'resolved')
Searchable as RT87812$
The text was updated successfully, but these errors were encountered: