Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] Coverity: regexec.c etc: off-by-one error #13757

Closed
p5pRT opened this issue Apr 23, 2014 · 7 comments
Closed

[PATCH] Coverity: regexec.c etc: off-by-one error #13757

p5pRT opened this issue Apr 23, 2014 · 7 comments
Labels

Comments

@p5pRT
Copy link

p5pRT commented Apr 23, 2014

Migrated from rt.perl.org#121708 (status was 'resolved')

Searchable as RT121708$

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2014

From @jhi

- the "c1 > 256" was off-by-one, it needed to be "c1 > 255",
  it could have caused the PL_fold_locale to be accessed one past
  the end, at offset 256, but we have dodged the bullet thanks to
  the regex engine optimizing the bad case away before we hit it
  (analysis by Karl Williamson)​: regexec.c
- comment fixes (pointed out by Karl Williamson)​: regexec.c
- add tests to nail down the behaviour of fold matching
  for the last of Latin-1 (0xFF, lowercase which curiously does not have
  uppercase within Latin-1). and the first pure Unicode​: t/re/pat.t

Attached.

@p5pRT
Copy link
Author

p5pRT commented Apr 23, 2014

From @jhi

0005-Fix-for-Coverity-perl5-CID-29033.patch
From 2bfa09bfd0d93131aa0da7926a49d16ca54970aa Mon Sep 17 00:00:00 2001
From: Jarkko Hietaniemi <jhi@iki.fi>
Date: Mon, 21 Apr 2014 17:07:20 -0400
Subject: [PATCH 5/9] Fix for Coverity perl5 CID 29033: Out-of-bounds read
 (OVERRUN) overrun-local: Overrunning array PL_fold_locale of 256 bytes at
 byte offset 256 using index c1 (which evaluates to 256).

- the "c1 > 256" was off-by-one, it needed to be "c1 > 255",
  it could have caused the PL_fold_locale to be accessed one past the end,
  at offset 256, but we have dodged the bullet thanks to the regex engine
  optimizing the bad case away before we hit it (analysis by Karl Williamson):
  regexec.c
- comment fixes (pointed out by Karl Williamson): regexec.c
- add tests to nail down the behaviour of fold matching
  for the last of Latin-1 (0xFF, lowercase which curiously does not have
  uppercase within Latin-1). and the first pure Unicode: t/re/pat.t
---
 regexec.c  |  8 ++++----
 t/re/pat.t | 26 +++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/regexec.c b/regexec.c
index 4ed2ba9..6b31aad 100644
--- a/regexec.c
+++ b/regexec.c
@@ -3693,7 +3693,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
         }
         else { /* an EXACTFish node which doesn't begin with a multi-char fold */
             c1 = is_utf8_pat ? valid_utf8_to_uvchr(pat, NULL) : *pat;
-            if (c1 > 256) {
+            if (c1 > 255) {
                 /* Load the folds hash, if not already done */
                 SV** listp;
                 if (! PL_utf8_foldclosures) {
@@ -3746,10 +3746,10 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
                         /* Folds that cross the 255/256 boundary are forbidden
                          * if EXACTFL (and isnt a UTF8 locale), or EXACTFA and
                          * one is ASCIII.  Since the pattern character is above
-                         * 256, and its only other match is below 256, the only
+                         * 255, and its only other match is below 256, the only
                          * legal match will be to itself.  We have thrown away
                          * the original, so have to compute which is the one
-                         * above 255 */
+                         * above 255. */
                         if ((c1 < 256) != (c2 < 256)) {
                             if ((OP(text_node) == EXACTFL
                                  && ! IN_UTF8_CTYPE_LOCALE)
@@ -3768,7 +3768,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
                     }
                 }
             }
-            else /* Here, c1 is < 255 */
+            else /* Here, c1 is <= 255 */
                 if (utf8_target
                     && HAS_NONLATIN1_FOLD_CLOSURE(c1)
                     && ( ! (OP(text_node) == EXACTFL && ! IN_UTF8_CTYPE_LOCALE))
diff --git a/t/re/pat.t b/t/re/pat.t
index 04f8b84..81cb64b 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -20,7 +20,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 721;  # Update this when adding/deleting tests.
+plan tests => 733;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1580,7 +1580,31 @@ EOP
         like "\x{AA}", qr/a?[\W_]/d, "\\W with /d synthetic start class works";
     }
 
+    {
+        # Verify that the very last Latin-1 U+00FF
+        # (LATIN SMALL LETTER Y WITH DIAERESIS)
+        # and its UPPER counterpart (U+0178 which is pure Unicode),
+        # and likewise for the very first pure Unicode
+        # (LATIN CAPITAL LETTER A WITH MACRON) fold-match properly,
+        # and there are no off-by-one logic errors in the transition zone.
+
+        ok("\xFF" =~ /\xFF/i, "Y WITH DIAERESIS l =~ l");
+        ok("\xFF" =~ /\x{178}/i, "Y WITH DIAERESIS l =~ u");
+        ok("\x{178}" =~ /\xFF/i, "Y WITH DIAERESIS u =~ l");
+        ok("\x{178}" =~ /\x{178}/i, "Y WITH DIAERESIS u =~ u");
 
+        # U+00FF with U+05D0 (non-casing Hebrew letter).
+        ok("\xFF\x{5D0}" =~ /\xFF\x{5D0}/i, "Y WITH DIAERESIS l =~ l");
+        ok("\xFF\x{5D0}" =~ /\x{178}\x{5D0}/i, "Y WITH DIAERESIS l =~ u");
+        ok("\x{178}\x{5D0}" =~ /\xFF\x{5D0}/i, "Y WITH DIAERESIS u =~ l");
+        ok("\x{178}\x{5D0}" =~ /\x{178}\x{5D0}/i, "Y WITH DIAERESIS u =~ u");
+
+        # U+0100.
+        ok("\x{100}" =~ /\x{100}/i, "A WITH MACRON u =~ u");
+        ok("\x{100}" =~ /\x{101}/i, "A WITH MACRON u =~ l");
+        ok("\x{101}" =~ /\x{100}/i, "A WITH MACRON l =~ u");
+        ok("\x{101}" =~ /\x{101}/i, "A WITH MACRON l =~ l");
+    }
 
 } # End of sub run_tests
 
-- 
1.8.5.2 (Apple Git-48)

@p5pRT
Copy link
Author

p5pRT commented May 7, 2014

From @jhi

On Tuesday-201404-22, 20​:28, perlbug-followup@​perl.org wrote​:

Greetings,

This message has been automatically generated in response to the
creation of a perl bug report regarding​:
"[PATCH] Coverity​: regexec.c etc​: off-by-one error".

There is no need to reply to this message right now. Your ticket has been
assigned an ID of [perl #121708].

You can view your ticket at
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121708

Better summary line, patch attached.

@p5pRT
Copy link
Author

p5pRT commented May 7, 2014

From @jhi

0001-Off-by-one-in-PL_fold_locale-use.patch
From 25977d2f914c0f17a4131f85c5cf8c024ba37452 Mon Sep 17 00:00:00 2001
From: Jarkko Hietaniemi <jhi@iki.fi>
Date: Wed, 7 May 2014 09:19:00 -0400
Subject: [PATCH] Off-by-one in PL_fold_locale use.

Fix for Coverity perl5 CID 29033: Out-of-bounds read
 (OVERRUN) overrun-local: Overrunning array PL_fold_locale of 256 bytes at
 byte offset 256 using index c1 (which evaluates to 256).

- the "c1 > 256" was off-by-one, it needed to be "c1 > 255",
  it could have caused the PL_fold_locale to be accessed one past the end,
  at offset 256, but we have dodged the bullet thanks to the regex engine
  optimizing the bad case away before we hit it (analysis by Karl Williamson):
  regexec.c
- comment fixes (pointed out by Karl Williamson): regexec.c
- add tests to nail down the behaviour of fold matching
  for the last of Latin-1 (0xFF, lowercase which curiously does not have
  uppercase within Latin-1). and the first pure Unicode: t/re/pat.t
---
 regexec.c  |  8 ++++----
 t/re/pat.t | 26 +++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/regexec.c b/regexec.c
index 4ed2ba9..6b31aad 100644
--- a/regexec.c
+++ b/regexec.c
@@ -3693,7 +3693,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
         }
         else { /* an EXACTFish node which doesn't begin with a multi-char fold */
             c1 = is_utf8_pat ? valid_utf8_to_uvchr(pat, NULL) : *pat;
-            if (c1 > 256) {
+            if (c1 > 255) {
                 /* Load the folds hash, if not already done */
                 SV** listp;
                 if (! PL_utf8_foldclosures) {
@@ -3746,10 +3746,10 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
                         /* Folds that cross the 255/256 boundary are forbidden
                          * if EXACTFL (and isnt a UTF8 locale), or EXACTFA and
                          * one is ASCIII.  Since the pattern character is above
-                         * 256, and its only other match is below 256, the only
+                         * 255, and its only other match is below 256, the only
                          * legal match will be to itself.  We have thrown away
                          * the original, so have to compute which is the one
-                         * above 255 */
+                         * above 255. */
                         if ((c1 < 256) != (c2 < 256)) {
                             if ((OP(text_node) == EXACTFL
                                  && ! IN_UTF8_CTYPE_LOCALE)
@@ -3768,7 +3768,7 @@ S_setup_EXACTISH_ST_c1_c2(pTHX_ const regnode * const text_node, int *c1p,
                     }
                 }
             }
-            else /* Here, c1 is < 255 */
+            else /* Here, c1 is <= 255 */
                 if (utf8_target
                     && HAS_NONLATIN1_FOLD_CLOSURE(c1)
                     && ( ! (OP(text_node) == EXACTFL && ! IN_UTF8_CTYPE_LOCALE))
diff --git a/t/re/pat.t b/t/re/pat.t
index 04f8b84..81cb64b 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -20,7 +20,7 @@ BEGIN {
     require './test.pl';
 }
 
-plan tests => 721;  # Update this when adding/deleting tests.
+plan tests => 733;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1580,7 +1580,31 @@ EOP
         like "\x{AA}", qr/a?[\W_]/d, "\\W with /d synthetic start class works";
     }
 
+    {
+        # Verify that the very last Latin-1 U+00FF
+        # (LATIN SMALL LETTER Y WITH DIAERESIS)
+        # and its UPPER counterpart (U+0178 which is pure Unicode),
+        # and likewise for the very first pure Unicode
+        # (LATIN CAPITAL LETTER A WITH MACRON) fold-match properly,
+        # and there are no off-by-one logic errors in the transition zone.
+
+        ok("\xFF" =~ /\xFF/i, "Y WITH DIAERESIS l =~ l");
+        ok("\xFF" =~ /\x{178}/i, "Y WITH DIAERESIS l =~ u");
+        ok("\x{178}" =~ /\xFF/i, "Y WITH DIAERESIS u =~ l");
+        ok("\x{178}" =~ /\x{178}/i, "Y WITH DIAERESIS u =~ u");
 
+        # U+00FF with U+05D0 (non-casing Hebrew letter).
+        ok("\xFF\x{5D0}" =~ /\xFF\x{5D0}/i, "Y WITH DIAERESIS l =~ l");
+        ok("\xFF\x{5D0}" =~ /\x{178}\x{5D0}/i, "Y WITH DIAERESIS l =~ u");
+        ok("\x{178}\x{5D0}" =~ /\xFF\x{5D0}/i, "Y WITH DIAERESIS u =~ l");
+        ok("\x{178}\x{5D0}" =~ /\x{178}\x{5D0}/i, "Y WITH DIAERESIS u =~ u");
+
+        # U+0100.
+        ok("\x{100}" =~ /\x{100}/i, "A WITH MACRON u =~ u");
+        ok("\x{100}" =~ /\x{101}/i, "A WITH MACRON u =~ l");
+        ok("\x{101}" =~ /\x{100}/i, "A WITH MACRON l =~ u");
+        ok("\x{101}" =~ /\x{101}/i, "A WITH MACRON l =~ l");
+    }
 
 } # End of sub run_tests
 
-- 
1.9.2

@p5pRT
Copy link
Author

p5pRT commented May 28, 2014

From @tsee

Thanks, applied as e8ea835.

@p5pRT
Copy link
Author

p5pRT commented May 28, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 28, 2014

@tsee - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant