Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some UTF-8 regular expression matches fail when read from file #15680

Closed
p5pRT opened this issue Oct 24, 2016 · 16 comments
Closed

Some UTF-8 regular expression matches fail when read from file #15680

p5pRT opened this issue Oct 24, 2016 · 16 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 24, 2016

Migrated from rt.perl.org#129950 (status was 'resolved')

Searchable as RT129950$

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From @hiroshi-manabe

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-\x{e4}]$}.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From @hiroshi-manabe

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of parenthes, i.e. m{^(a|a\x{e4})$} etc.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From [Unknown Contact. See original ticket]

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of parenthes, i.e. m{^(a|a\x{e4})$} etc.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From @hiroshi-manabe

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From [Unknown Contact. See original ticket]

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From @dcollinsn

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

This seems interesting​:

$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
matched
$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

And with -Dr...

dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Debugging flag values​: (see also -d)
  p Tokenizing and parsing (with v, displays parse stack)
  s Stack snapshots (with v, displays all stacks)
  l Context (loop) stack processing
  t Trace execution
  o Method and overloading resolution
  c String/numeric conversions
  P Print profiling info, source file input state
  m Memory and SV allocation
  f Format processing
  r Regular expression parsing and execution
  x Syntax tree dump
  u Tainting checks
  H Hash dump -- usurps values()
  X Scratchpad allocation
  D Cleaning up
  S Op slab allocation
  T Tokenising
  R Include reference counts of dumped variables (eg when using -Ds)
  J Do not s,t,P-debug (Jump over) opcodes within package DB
  v Verbose​: use in conjunction with other flags
  C Copy On Write
  A Consistency checks on internal structures
  q quiet - currently only suppresses the 'EXECUTING' message
  M trace smart match resolution
  B dump suBroutine definitions, including special Blocks like BEGIN
  L trace some locale setting information--for Perl core development
  i trace PerlIO layer processing

EXECUTING...

matched
dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a\x{e4})$"
rarest char ▒ at 1
Final program​:
  1​: SBOL /^/ (2)
  2​: OPEN1 (4)
  4​: EXACT <a\x{e4}> (6)
  6​: CLOSE1 (8)
  8​: SEOL (9)
  9​: END (0)
anchored "a%x{e4}"$ at 0 (checking anchored noscan) anchored(SBOL) minlen 2
Enabling $` $&amp; $' support (0x7).

EXECUTING...

Matching REx "^(a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit​: trying to determine minimum start position...
rarest char ▒ at 2
  Looking for check substr at fixed offset 0...
Intuit​: Successfully guessed​: match at offset 0
  0 <> <a%x{e4}> | 0| 1​:SBOL /^/(2)
  0 <> <a%x{e4}> | 0| 2​:OPEN1(4)
  0 <> <a%x{e4}> | 0| 4​:EXACT <a\x{e4}>(6)
  3 <a%x{e4}> <> | 0| 6​:CLOSE1(8)
  3 <a%x{e4}> <> | 0| 8​:SEOL(9)
  3 <a%x{e4}> <> | 0| 9​:END(0)
Match successful!
matched
Freeing REx​: "^(a\x{e4})$"
dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a|a\x{e4})$"
rarest char
at 0
rarest char a at 0
Final program​:
  1​: SBOL /^/ (2)
  2​: OPEN1 (4)
  4​: EXACT <a> (6)
  6​: TRIE-EXACT[\xE4] (10)
  <>
  <\344>
  10​: CLOSE1 (12)
  12​: SEOL (13)
  13​: END (0)
anchored "a" at 0 floating ""$ at 1..2 (checking anchored noscan) anchored(SBOL) minlen 1
Enabling $` $&amp; $' support (0x7).

EXECUTING...

Matching REx "^(a|a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit​: trying to determine minimum start position...
rarest char
at 0
rarest char a at 0
  Looking for check substr at fixed offset 0...
Intuit​: Successfully guessed​: match at offset 0
  0 <> <a%x{e4}> | 0| 1​:SBOL /^/(2)
  0 <> <a%x{e4}> | 0| 2​:OPEN1(4)
  0 <> <a%x{e4}> | 0| 4​:EXACT <a>(6)
  1 <a> <%x{e4}> | 0| 6​:TRIE-EXACT[\xE4](10)
  | 0| matched empty string...
  1 <a> <%x{e4}> | 0| 10​:CLOSE1(12)
  1 <a> <%x{e4}> | 0| 12​:SEOL(13)
  | 0| failed...
Match failed
not matched
Freeing REx​: "^(a|a\x{e4})$"

Unicode errors aside, is the TRIE optimization getting this wrong?

--
Respectfully,
Dan Collins

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From [Unknown Contact. See original ticket]

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

This seems interesting​:

$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
matched
$ perl -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

And with -Dr...

dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -D -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Debugging flag values​: (see also -d)
  p Tokenizing and parsing (with v, displays parse stack)
  s Stack snapshots (with v, displays all stacks)
  l Context (loop) stack processing
  t Trace execution
  o Method and overloading resolution
  c String/numeric conversions
  P Print profiling info, source file input state
  m Memory and SV allocation
  f Format processing
  r Regular expression parsing and execution
  x Syntax tree dump
  u Tainting checks
  H Hash dump -- usurps values()
  X Scratchpad allocation
  D Cleaning up
  S Op slab allocation
  T Tokenising
  R Include reference counts of dumped variables (eg when using -Ds)
  J Do not s,t,P-debug (Jump over) opcodes within package DB
  v Verbose​: use in conjunction with other flags
  C Copy On Write
  A Consistency checks on internal structures
  q quiet - currently only suppresses the 'EXECUTING' message
  M trace smart match resolution
  B dump suBroutine definitions, including special Blocks like BEGIN
  L trace some locale setting information--for Perl core development
  i trace PerlIO layer processing

EXECUTING...

matched
dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a\x{e4})$"
rarest char ▒ at 1
Final program​:
  1​: SBOL /^/ (2)
  2​: OPEN1 (4)
  4​: EXACT <a\x{e4}> (6)
  6​: CLOSE1 (8)
  8​: SEOL (9)
  9​: END (0)
anchored "a%x{e4}"$ at 0 (checking anchored noscan) anchored(SBOL) minlen 2
Enabling $` $&amp; $' support (0x7).

EXECUTING...

Matching REx "^(a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit​: trying to determine minimum start position...
rarest char ▒ at 2
  Looking for check substr at fixed offset 0...
Intuit​: Successfully guessed​: match at offset 0
  0 <> <a%x{e4}> | 0| 1​:SBOL /^/(2)
  0 <> <a%x{e4}> | 0| 2​:OPEN1(4)
  0 <> <a%x{e4}> | 0| 4​:EXACT <a\x{e4}>(6)
  3 <a%x{e4}> <> | 0| 6​:CLOSE1(8)
  3 <a%x{e4}> <> | 0| 8​:SEOL(9)
  3 <a%x{e4}> <> | 0| 9​:END(0)
Match successful!
matched
Freeing REx​: "^(a\x{e4})$"
dcollins@​nightshade64​:~/toolchain$ perl5.25.2-debug -Dr -Ci -e 'open IN, "<", "foo.txt"; $_ = &lt;IN&gt;; print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
Compiling REx "^(a|a\x{e4})$"
rarest char
at 0
rarest char a at 0
Final program​:
  1​: SBOL /^/ (2)
  2​: OPEN1 (4)
  4​: EXACT <a> (6)
  6​: TRIE-EXACT[\xE4] (10)
  <>
  <\344>
  10​: CLOSE1 (12)
  12​: SEOL (13)
  13​: END (0)
anchored "a" at 0 floating ""$ at 1..2 (checking anchored noscan) anchored(SBOL) minlen 1
Enabling $` $&amp; $' support (0x7).

EXECUTING...

Matching REx "^(a|a\x{e4})$" against "a%x{e4}"
UTF-8 string...
Intuit​: trying to determine minimum start position...
rarest char
at 0
rarest char a at 0
  Looking for check substr at fixed offset 0...
Intuit​: Successfully guessed​: match at offset 0
  0 <> <a%x{e4}> | 0| 1​:SBOL /^/(2)
  0 <> <a%x{e4}> | 0| 2​:OPEN1(4)
  0 <> <a%x{e4}> | 0| 4​:EXACT <a>(6)
  1 <a> <%x{e4}> | 0| 6​:TRIE-EXACT[\xE4](10)
  | 0| matched empty string...
  1 <a> <%x{e4}> | 0| 10​:CLOSE1(12)
  1 <a> <%x{e4}> | 0| 12​:SEOL(13)
  | 0| failed...
Match failed
not matched
Freeing REx​: "^(a|a\x{e4})$"

Unicode errors aside, is the TRIE optimization getting this wrong?

--
Respectfully,
Dan Collins

@p5pRT
Copy link
Author

p5pRT commented Oct 24, 2016

From @tonycoz

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

The string doesn't need to be from a file​:

$ ./perl -e '$_ = "a\xE4"; utf8​::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

(blead perl)

The match is failing around like 5611 of regexec.c​:

  if ( trie->bitmap
  && (NEXTCHR_IS_EOS || !TRIE_BITMAP_TEST(trie, nextchr)))
  {
  if (trie->states[ state ].wordnum) {
  DEBUG_EXECUTE_r(
  Perl_re_exec_indentf( aTHX_ "%smatched empty string...%s\n",
  depth, PL_colors[4], PL_colors[5])
  );

At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

Tony

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2016

From @iabyn

On Mon, Oct 24, 2016 at 03​:57​:15PM -0700, Tony Cook via RT wrote​:

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

The string doesn't need to be from a file​:

$ ./perl -e '$_ = "a\xE4"; utf8​::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

(blead perl)

The match is failing around like 5611 of regexec.c​:

            if \(   trie\->bitmap
                && \(NEXTCHR\_IS\_EOS || \!TRIE\_BITMAP\_TEST\(trie\, nextchr\)\)\)
            \{
            if \(trie\->states\[ state \]\.wordnum\) \{
                 DEBUG\_EXECUTE\_r\(
                        Perl\_re\_exec\_indentf\( aTHX\_  "%smatched empty string\.\.\.%s\\n"\,
                                      depth\, PL\_colors\[4\]\, PL\_colors\[5\]\)
                    \);

At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

I'm looking into this as we speak.

--
I thought I was wrong once, but I was mistaken.

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2016

From @demerphq

On 25 October 2016 at 12​:12, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Oct 24, 2016 at 03​:57​:15PM -0700, Tony Cook via RT wrote​:

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

The string doesn't need to be from a file​:

$ ./perl -e '$_ = "a\xE4"; utf8​::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

(blead perl)

The match is failing around like 5611 of regexec.c​:

            if \(   trie\->bitmap
                && \(NEXTCHR\_IS\_EOS || \!TRIE\_BITMAP\_TEST\(trie\, nextchr\)\)\)
            \{
              if \(trie\->states\[ state \]\.wordnum\) \{
                   DEBUG\_EXECUTE\_r\(
                        Perl\_re\_exec\_indentf\( aTHX\_  "%smatched empty string\.\.\.%s\\n"\,
                                      depth\, PL\_colors\[4\]\, PL\_colors\[5\]\)
                    \);

At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

I'm looking into this as we speak.

I was going to look into it later as well. Let me know how far you get.

We used to preload the bitmap with the first byte of the unicode
representation of the string, but I guess I can leave it to you.

Let me know otherwise.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Oct 25, 2016

From @iabyn

On Tue, Oct 25, 2016 at 12​:31​:59PM +0200, demerphq wrote​:

On 25 October 2016 at 12​:12, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Oct 24, 2016 at 03​:57​:15PM -0700, Tony Cook via RT wrote​:

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

The string doesn't need to be from a file​:

$ ./perl -e '$_ = "a\xE4"; utf8​::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

(blead perl)

The match is failing around like 5611 of regexec.c​:

            if \(   trie\->bitmap
                && \(NEXTCHR\_IS\_EOS || \!TRIE\_BITMAP\_TEST\(trie\, nextchr\)\)\)
            \{
              if \(trie\->states\[ state \]\.wordnum\) \{
                   DEBUG\_EXECUTE\_r\(
                        Perl\_re\_exec\_indentf\( aTHX\_  "%smatched empty string\.\.\.%s\\n"\,
                                      depth\, PL\_colors\[4\]\, PL\_colors\[5\]\)
                    \);

At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

I'm looking into this as we speak.

I was going to look into it later as well. Let me know how far you get.

Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning
false, while good code like

  $_ = "a\x64";
  print "match\n" if m{^(a|a\x{64})$};

doesn't.
At which point I got distracted and haven't looked further. You're probably
a better choice than me to take this further :-)

--
The Enterprise's efficient long-range scanners detect a temporal vortex
distortion in good time, allowing it to be safely avoided via a minor
course correction.
  -- Things That Never Happen in "Star Trek" #21

@p5pRT
Copy link
Author

p5pRT commented Oct 27, 2016

From @demerphq

On 25 October 2016 at 13​:45, Dave Mitchell <davem@​iabyn.com> wrote​:

On Tue, Oct 25, 2016 at 12​:31​:59PM +0200, demerphq wrote​:

On 25 October 2016 at 12​:12, Dave Mitchell <davem@​iabyn.com> wrote​:

On Mon, Oct 24, 2016 at 03​:57​:15PM -0700, Tony Cook via RT wrote​:

On Sun Oct 23 21​:48​:55 2016, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:44​:35, manabe.hiroshi@​gmail.com wrote​:

On 2016-10月-23 日 21​:23​:20, manabe.hiroshi@​gmail.com wrote​:

You can reproduc the bug with the following procedure​:
1. perl -CO -e 'print "a\x{e4}";' > foo.txt # aä
2. perl -CI -e 'open IN, "<", "foo.txt"; $_ = <IN>; print
m{^a|a\x{e4}$} . "matched\n" : "not matched\n";
Output​: not matched

This happenes only when the string is read from a file handle and the
second character is in the range of \x{80}-\x{ff}.
Curiously enough, the match succeeds if the regexp is m{^a|a[\x{e3}-
\x{e4}]$} or m{^a|a[\x{e4}-\x{e5}]$}, but not if it is m{^a|a[\x{e4}-
\x{e4}]$}.

Sorry, the bug only reproduces itself when there is a set of
parenthes, i.e. m{^(a|a\x{e4})$} etc.

Sorry again, the correct unicode option for the step 2 was -Ci.

The string doesn't need to be from a file​:

$ ./perl -e '$_ = "a\xE4"; utf8​::upgrade($_); print m{^(a|a\x{e4})$} ? "matched\n" : "not matched\n";'
not matched

(blead perl)

The match is failing around like 5611 of regexec.c​:

            if \(   trie\->bitmap
                && \(NEXTCHR\_IS\_EOS || \!TRIE\_BITMAP\_TEST\(trie\, nextchr\)\)\)
            \{
              if \(trie\->states\[ state \]\.wordnum\) \{
                   DEBUG\_EXECUTE\_r\(
                        Perl\_re\_exec\_indentf\( aTHX\_  "%smatched empty string\.\.\.%s\\n"\,
                                      depth\, PL\_colors\[4\]\, PL\_colors\[5\]\)
                    \);

At this point nextchr has the first byte of the UTF-8 encoded \xE4 (0xc3).

I'm looking into this as we speak.

I was going to look into it later as well. Let me know how far you get.

Not far as it turns out. The failing code has TRIE_BITMAP_TEST() returning
false, while good code like

$\_ = "a\\x64";
print "match\\n" if m\{^\(a|a\\x\{64\}\)$\};

doesn't.
At which point I got distracted and haven't looked further. You're probably
a better choice than me to take this further :-)

Fixed. This ticket can be closed.

commit da42332
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Thu Oct 27 13​:52​:24 2016 +0200

  regcomp.c​: fix perl #129950 - fix firstchar bitmap under utf8 with
prefix optimisation

  The trie code contains a number of sub optimisations, one of which
  extracts common prefixes from alternations, and another which isa
  bitmap of the possible matching first chars.

  The bitmap needs to contain the possible first octets of the string
  which the trie can match, and for codepoints which might have a different
  first octet under utf8 or non-utf8 need to register BOTH codepoints.

  So for instance in the pattern (?​:a|a\x{E4}) we should restructure this
  as a(|\x{E4), and the bitmap for the trie should contain both \x{E4} AND
  \x{C3} as \x{C3} is the first byte of \x{EF} expressed as utf8.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Oct 28, 2016

@iabyn - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0, this and 210 other issues have been
resolved.

Perl 5.26.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented May 30, 2017

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant