Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assert in /\p^ / #14544

Closed
p5pRT opened this issue Feb 27, 2015 · 9 comments
Closed

assert in /\p^ / #14544

p5pRT opened this issue Feb 27, 2015 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 27, 2015

Migrated from rt.perl.org#123946 (status was 'resolved')

Searchable as RT123946$

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @hvds

AFL (<http​://lcamtuf.coredump.cx/afl/>) finds this​:

% ./perl -Ilib -ce '/\p^ /'
Use of uninitialized value $table in concatenation (.) or string at lib/utf8_heavy.pl line 398.
Use of uninitialized value $_[0] in substitution (s///) at lib/utf8_heavy.pl line 23.
Use of uninitialized value $loose in pattern match (m//) at lib/utf8_heavy.pl line 25.
Use of uninitialized value $table in concatenation (.) or string at lib/utf8_heavy.pl line 409.
perl​: sv.c​:11436​: Perl_sv_vcatpvfn_flags​: Assertion `(IV)elen >= 0' failed.
Aborted (core dumped)
%

(I haven't yet looked at the utf8_heavy warnings, only the coredump.)

I'm not sure if (for example) /\p^L/ is intended to be supported, but the length handling in regcomp.c is suspect for this case​: after \p if we see '{', we set the length (UV n) to be the number of characters to the matching close brace, else we set it to 1 (regcomp.c​:14177). If we then see '^' we set a flag and decrement n, and skip past additional whitespace further decrementing n as we go. If we then get an error, we can end up passing a negative (wrapped to large unsigned) n as the length.

My guess is we want to support /\p^L/ but not /\p^ L/; the diff below is a start towards that, but it's not sufficient - I think we need to move the parsing out of the !SIZE_ONLY guard, or we can't be sure to continue at the right point.

Hugo
--- a/regcomp.c
+++ b/regcomp.c
@​@​ -14122,6 +14122,7 @​@​ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp,
  case 'P'​:
  {
  char *e;
+ int braced = 1;

  /* We will handle any undefined properties ourselves */
  U8 swash_init_flags = _CORE_SWASH_INIT_RETURN_IF_UNDEF
@​@​ -14149,6 +14150,7 @​@​ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp,
  else {
  e = RExC_parse;
  n = 1;
+ braced = 0;
  }
  if (!SIZE_ONLY) {
  SV* invlist;
@​@​ -14156,17 +14158,19 @​@​ S_regclass(pTHX_ RExC_state_t *pRExC_state, I32 *flagp

  if (UCHARAT(RExC_parse) == '^') {
  RExC_parse++;
- n--;
+ if (braced) n--;
  /* toggle. (The rhs xor gets the single bit that
  * differs between P and p; the other xor inverts just
  * that bit) */
  value ^= 'P' ^ 'p';

+ if (braced) {
  while (isSPACE(*RExC_parse)) {
  RExC_parse++;
  n--;
  }
  }
+ }
  /* Try to get the definition of the property into
  * <invlist>. If /i is in effect, the effective property
  * will have its name be <__NAME_i>. The design is

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @hvds

On Thu Feb 26 17​:52​:20 2015, hv wrote​:

AFL (<http​://lcamtuf.coredump.cx/afl/>) finds this​:

% ./perl -Ilib -ce '/\p^ /'
Use of uninitialized value $table in concatenation (.) or string at
lib/utf8_heavy.pl line 398.
Use of uninitialized value $_[0] in substitution (s///) at
lib/utf8_heavy.pl line 23.
Use of uninitialized value $loose in pattern match (m//) at
lib/utf8_heavy.pl line 25.
Use of uninitialized value $table in concatenation (.) or string at
lib/utf8_heavy.pl line 409.
perl​: sv.c​:11436​: Perl_sv_vcatpvfn_flags​: Assertion `(IV)elen >= 0'
failed.
Aborted (core dumped)
%

(I haven't yet looked at the utf8_heavy warnings, only the coredump.)

I now looked at the warnings​: in SWASHNEW, we check $type is non-empty (in fact TRUE) before stripping whitespace; if it consists only of whitespace, we later end up with undef $property and $table. Either of the following one-line insertions seem to be enough to avoid the problem, but I'm not sure if either is actually the right thing - Karl, do you have an opinion on this?

In the above case, the initial $type is " \n".

Hugo

--- a/lib/utf8_heavy.pl
+++ b/lib/utf8_heavy.pl
@​@​ -119,6 +119,7 @​@​ sub _loose_name ($) {

  $type =~ s/^\s+//;
  $type =~ s/\s+$//;
+return $type unless length $type;

  # regcomp.c surrounds the property name with '__" and '_i' if this
  # is to be caseless matching.

--- a/lib/utf8_heavy.pl
+++ b/lib/utf8_heavy.pl
@​@​ -194,6 +194,7 @​@​ sub _loose_name ($) {
  # value indicates the table we should use.
  my ($property, $table, @​remainder) =
  split /\s*[​:=]\s*/, $property_and_table, -1
+$property //= "";
  if (@​remainder) {
  pop @​recursed if @​recursed;
  return $type;
(END)

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @hvds

On Thu Feb 26 17​:52​:20 2015, hv wrote​:

My guess is we want to support /\p^L/ but not /\p^ L/; the diff below
is a start towards that, but it's not sufficient - I think we need to
move the parsing out of the !SIZE_ONLY guard, or we can't be sure to
continue at the right point.

I'm not at all sure about that, and the docs are coy - the only mention I can find of using a caret to invert a property is in perlunicode​:
  You can also use negation in both "\p{}" and "\P{}" by introducing a
  caret (^) between the first brace and the property name​: "\p{^Tamil}"
  is equal to "\P{Tamil}".

A CPAN grep shows braceless \p^x being tested by ShiftJIS​::Regexp, and documented in passing by its pod examples, but didn't show any other uses.

Permitting it, but not skipping whitespace after the caret, results in behaviour I don't understand - I think it's successfully matching some kind of null property, so that /\p^ / ends up roughly equivalent to /./s. So maybe it is better to skip whitespace after all; on that assumption I've pushed the branch hv/braceless-property for review, with one commit for the utf8_heavy warnings and a second for the parse issues.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

From @khwilliamson

On 02/27/2015 06​:48 AM, Hugo van der Sanden via RT wrote​:

On Thu Feb 26 17​:52​:20 2015, hv wrote​:

My guess is we want to support /\p^L/ but not /\p^ L/; the diff below
is a start towards that, but it's not sufficient - I think we need to
move the parsing out of the !SIZE_ONLY guard, or we can't be sure to
continue at the right point.

I'm not at all sure about that, and the docs are coy - the only mention I can find of using a caret to invert a property is in perlunicode​:
You can also use negation in both "\p{}" and "\P{}" by introducing a
caret (^) between the first brace and the property name​: "\p{^Tamil}"
is equal to "\P{Tamil}".

A CPAN grep shows braceless \p^x being tested by ShiftJIS​::Regexp, and documented in passing by its pod examples, but didn't show any other uses.

Permitting it, but not skipping whitespace after the caret, results in behaviour I don't understand - I think it's successfully matching some kind of null property, so that /\p^ / ends up roughly equivalent to /./s. So maybe it is better to skip whitespace after all; on that assumption I've pushed the branch hv/braceless-property for review, with one commit for the utf8_heavy warnings and a second for the parse issues.

Hugo

---
via perlbug​: queue​: perl5 status​: new
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123946

I had noticed lately various issues with \p parsing, but thought it too
late to change until 5.23, but I'll mention some here.

First, white space is supposed to be significant in patterns except
under /x. However, within the braces of \p{}, it is ignored regardless
of /x, because Unicode suggests/requires doing that in resolving
property names. Thus /\p ^ L/ should be an error except possibly under
/x. It's a bug that it isn't. And I think the first thing to do is to
make it so. I rather think it should be forbidden here except within
braces, but maybe backcompat says we have to allow it. But definitely
not unless /x is specified.

Also, the white space accepted includes vertical space. I don't think
that was intended; I don't think we should be accepting

\p

^

L

with or without braces. I think those isSPACE calls should be isBLANK
calls instead. But if we do want to continue allowing vertical space,
we should be using Unicode's pattern white space instead of regular SPACE.

Another issue is probably endemic through perl, and that is the use of
strchr() to find matching characters, in this case the right brace.
Consider

/\p{foo # This was supposed to be a comment
  # and this
  # and so on
  # including this which contains a }
  # ...
/x

This actually compiles, and when you match against it, you get
Can't find Unicode property definition "foo # This was supposed to be a
comment"

It thought this was a user-defined property (with a multi-line name,
though the error message only prints the first line of that name). It
may be hard for the user to associate the error message with the actual
error if the pattern is buried inside a module. It should have been a
compile-time error.

More reasonable things are even problematic​: \p{Any=Y} might make sense
if Any were a Unicode property, all of which are specifiable in a
bipartite manner like this. But 'Any' is a Perl extension, none of
which can currently be specified this way. The code thinks it must be a
user-defined property whose name is "Any=Y", and compiles to that
interpretation, and the error found only upon execution. The code
should be fixed to look for valid syntax in user-defined property names.

strchr() leads to this kind of problem which exists in other constructs
besides this one.

To go back to your original question. We need to decide exactly what
should be legal first. I gave my opinion; others welcome (as long as
they agree with mine, that is).

@p5pRT
Copy link
Author

p5pRT commented Feb 27, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 17, 2015

From @khwilliamson

The main issue was fixed by
163a633

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Nov 17, 2015

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented May 13, 2016

From @khwilliamson

Thank you for submitting this report. You have helped make Perl better.
 
With the release of Perl 5.24.0 on May 9, 2016, this and 149 other issues have been resolved.

Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0

@p5pRT
Copy link
Author

p5pRT commented May 13, 2016

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant