Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\C yields check_locale_boundary_crossing assertion failure #14520

Closed
p5pRT opened this issue Feb 17, 2015 · 11 comments
Closed

\C yields check_locale_boundary_crossing assertion failure #14520

p5pRT opened this issue Feb 17, 2015 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Feb 17, 2015

Migrated from rt.perl.org#123861 (status was 'resolved')

Searchable as RT123861$

@p5pRT
Copy link
Author

p5pRT commented Feb 17, 2015

From @hvds

AFL (<http​://lcamtuf.coredump.cx/afl/>) finds this​:

% ./perl -Ilib -e '"\700" =~ /\C0/il'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE 0/ at -e line 1.
perl​: utf8.c​:1884​: S_check_locale_boundary_crossing​: Assertion `((U8)(*p) >= 0xc4)' failed.
Aborted (core dumped)
%

I guess this is the sort of reason why \C was deprecated in the first place, but I wonder if we could be doing a better job to detect this and either disallow it earlier or in some other way survive better.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2015

From @hvds

Here's a variant that triggers a different assert​:

% ./perl -Ilib -e '"0\7000"=~m{\C+?0}'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE +?0/ at -e line 1.
perl​: regexec.c​:6606​: S_regmatch​: Assertion `n == (32767) || locinput == li' failed.
Aborted (core dumped)
%

Hugo

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2015

From @cpansprout

On Tue Feb 17 18​:22​:58 2015, hv wrote​:

Here's a variant that triggers a different assert​:

% ./perl -Ilib -e '"0\7000"=~m{\C+?0}'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE +?0/ at
-e line 1.
perl​: regexec.c​:6606​: S_regmatch​: Assertion `n == (32767) || locinput
== li' failed.
Aborted (core dumped)
%

I have a debugging 5.14.4 installed, and it doesn’t fail the assertion. When was this bug introduced?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Feb 22, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 23, 2015

From @hvds

On Sun Feb 22 11​:20​:18 2015, sprout wrote​:

On Tue Feb 17 18​:22​:58 2015, hv wrote​:

Here's a variant that triggers a different assert​:

% ./perl -Ilib -e '"0\7000"=~m{\C+?0}'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE +?0/
at
-e line 1.
perl​: regexec.c​:6606​: S_regmatch​: Assertion `n == (32767) || locinput
== li' failed.
Aborted (core dumped)
%

I have a debugging 5.14.4 installed, and it doesn’t fail the
assertion. When was this bug introduced?

The first case (/\C0/il) bisects to quite recently​:

commit 1d39b2c
Author​: Karl Williamson <khw@​cpan.org>
Date​: Fri Dec 26 18​:31​:04 2014 -0700

  Simplify foldEQ_utf8
 
  This moves the uncommon case of handling inputs under non-UTF-8 locales
  out of this function to the functions it calls, which already have the
  logic to handle it. This simplifies this function, cutting a couple
  branches each time through the loop from the common usage.
 
  The locale handling is slowed down somewhat, but even if that were a
  concern, another simpler function is normally used for locale handling.
  This gets called only when one or both of the comparison strings is
  UTF-8, which should be comparatively rare for non-UTF8 locales.

The second case was in 5.18​:

commit eb72505
Author​: David Mitchell <davem@​iabyn.com>
Date​: Thu Sep 13 19​:58​:25 2012 +0100

  regmatch()​: remove reginput from CURLY etc
 
  reginput mostly tracked locinput, except when regrepeat() was called.
  With a bit of jiggling, it could be eliminated for these blocks of code.
 
  This is part of a campaign to eliminate the reginput variable.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2015

From @khwilliamson

On 02/22/2015 06​:59 PM, Hugo van der Sanden via RT wrote​:

On Sun Feb 22 11​:20​:18 2015, sprout wrote​:

On Tue Feb 17 18​:22​:58 2015, hv wrote​:

Here's a variant that triggers a different assert​:

% ./perl -Ilib -e '"0\7000"=~m{\C+?0}'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE +?0/
at
-e line 1.
perl​: regexec.c​:6606​: S_regmatch​: Assertion `n == (32767) || locinput
== li' failed.
Aborted (core dumped)
%

I have a debugging 5.14.4 installed, and it doesn’t fail the
assertion. When was this bug introduced?

The first case (/\C0/il) bisects to quite recently​:

commit 1d39b2c
Author​: Karl Williamson <khw@​cpan.org>
Date​: Fri Dec 26 18​:31​:04 2014 -0700

 Simplify foldEQ\_utf8

 This moves the uncommon case of handling inputs under non\-UTF\-8 locales
 out of this function to the functions it calls\, which already have the
 logic to handle it\.  This simplifies this function\, cutting a couple
 branches each time through the loop from the common usage\.

 The locale handling is slowed down somewhat\, but even if that were a
 concern\, another simpler function is normally used for locale handling\.
 This gets called only when one or both of the comparison strings is
 UTF\-8\, which should be comparatively rare for non\-UTF8 locales\.

This bisect doesn't really mean anything. Things were added that didn't
need this assert before.

I'm having trouble reproducing it. What locale is in effect?

The second case was in 5.18​:

commit eb72505
Author​: David Mitchell <davem@​iabyn.com>
Date​: Thu Sep 13 19​:58​:25 2012 +0100

 regmatch\(\)&#8203;: remove reginput from CURLY etc

 reginput mostly tracked locinput\, except when regrepeat\(\) was called\.
 With a bit of jiggling\, it could be eliminated for these blocks of code\.

 This is part of a campaign to eliminate the reginput variable\.

Hugo

---
via perlbug​: queue​: perl5 status​: open
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123861

I don't know what to do here. It would be a lot of work to adequately
support \C, a deprecated feature that we want to get rid of yesterday,
and no one should want to pour effort into. One thing that comes to
mind would be to set a flag during regex execution when \C matches a
partial UTF-8 character, and then add code that tests that flag, and
says to other code that you had better be extra careful. But this is a
lot of (wasted) work.

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2015

From @hvds

On Sat Feb 28 20​:43​:06 2015, public@​khwilliamson.com wrote​:

On 02/22/2015 06​:59 PM, Hugo van der Sanden via RT wrote​:

On Sun Feb 22 11​:20​:18 2015, sprout wrote​:

On Tue Feb 17 18​:22​:58 2015, hv wrote​:

Here's a variant that triggers a different assert​:

% ./perl -Ilib -e '"0\7000"=~m{\C+?0}'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE +?0/
at
-e line 1.
perl​: regexec.c​:6606​: S_regmatch​: Assertion `n == (32767) ||
locinput
== li' failed.
Aborted (core dumped)
%

I have a debugging 5.14.4 installed, and it doesn’t fail the
assertion. When was this bug introduced?

The first case (/\C0/il) bisects to quite recently​:

commit 1d39b2c
Author​: Karl Williamson <khw@​cpan.org>
Date​: Fri Dec 26 18​:31​:04 2014 -0700

Simplify foldEQ_utf8

This moves the uncommon case of handling inputs under non-UTF-8
locales
out of this function to the functions it calls, which already have
the
logic to handle it. This simplifies this function, cutting a couple
branches each time through the loop from the common usage.

The locale handling is slowed down somewhat, but even if that were a
concern, another simpler function is normally used for locale
handling.
This gets called only when one or both of the comparison strings is
UTF-8, which should be comparatively rare for non-UTF8 locales.

This bisect doesn't really mean anything. Things were added that
didn't
need this assert before.

I'm having trouble reproducing it. What locale is in effect?

C locale​:

% LC_ALL=C ./perl -Ilib -e '"\700" =~ /\C0/il'
\C is deprecated in regex; marked by <-- HERE in m/\C <-- HERE 0/ at -e line 1.
perl​: utf8.c​:1890​: S_check_locale_boundary_crossing​: Assertion `((U8)(*p) >= 0xc4)' failed.
Aborted (core dumped)
%

In utf8.c​:Perl__to_utf8_fold_flags, it's assumed we're at the start of a well-formed character; in this case, however, we're calling it with p pointing at the second octet of [\xc7 \x80], so we fail the tests UTF8_IS_INVARIANT(*p) and UTF8_IS_DOWNGRADEABLE_START(*p) and fall through to​:
  else { /* utf8, ord above 255 */

Here's the full stack trace​:

perl​: utf8.c​:1890​: S_check_locale_boundary_crossing​: Assertion `((U8)(*p) >= 0xc4)' failed.

Program received signal SIGABRT, Aborted.
0x00007ffff70e9bb9 in __GI_raise (sig=sig@​entry=6)
  at ../nptl/sysdeps/unix/sysv/linux/raise.c​:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c​: No such file or directory.
(gdb) where
#0 0x00007ffff70e9bb9 in __GI_raise (sig=sig@​entry=6)
  at ../nptl/sysdeps/unix/sysv/linux/raise.c​:56
#1 0x00007ffff70ecfc8 in __GI_abort () at abort.c​:89
#2 0x00007ffff70e2a76 in __assert_fail_base (
  fmt=0x7ffff7234370 "%s%s%s​:%u​: %s%sAssertion `%s' failed.\n%n",
  assertion=assertion@​entry=0x80c65e "((U8)(*p) >= 0xc4)",
  file=file@​entry=0x80bf15 "utf8.c", line=line@​entry=1890,
  function=function@​entry=0x80dd20 <__PRETTY_FUNCTION__.14465> "S_check_locale_boundary_crossing") at assert.c​:92
#3 0x00007ffff70e2b22 in __GI___assert_fail (
  assertion=0x80c65e "((U8)(*p) >= 0xc4)", file=0x80bf15 "utf8.c",
  line=1890,
  function=0x80dd20 <__PRETTY_FUNCTION__.14465> "S_check_locale_boundary_crossing") at assert.c​:101
#4 0x00000000006d7499 in S_check_locale_boundary_crossing (p=0xa64751 "\200",
  result=128, ustrp=0x7fffffffd510 "\200\325\377\377\377\177",
  lenp=0x7fffffffd4b8) at utf8.c​:1890
#5 0x00000000006d828f in Perl__to_utf8_fold_flags (p=0xa64751 "\200",
  ustrp=0x7fffffffd510 "\200\325\377\377\377\177", lenp=0x7fffffffd4b8,
  flags=3 '\003') at utf8.c​:2229
#6 0x00000000006e159d in Perl_foldEQ_utf8_flags (s1=0xa646c8 "0", pe1=0x0,
  l1=1, u1=false, s2=0xa64751 "\200", pe2=0x7fffffffd668, l2=0, u2=true,
  flags=2) at utf8.c​:4084
#7 0x00000000006c38a3 in S_regmatch (reginfo=0x7fffffffe1a0,
  startpos=0xa64750 "\307\200", prog=0xa646c0) at regexec.c​:5473
#8 0x00000000006bc227 in S_regtry (reginfo=0x7fffffffe1a0,
  startposp=0x7fffffffdd48) at regexec.c​:3492
#9 0x00000000006b1a3c in S_find_byclass (prog=0xa649d0, c=0xa646c0,
  s=0xa64750 "\307\200", strend=0xa64752 "", reginfo=0x7fffffffe1a0)
  at regexec.c​:1809
#10 0x00000000006bb5a8 in Perl_regexec_flags (rx=0xa614f0,
  stringarg=0xa64750 "\307\200", strend=0xa64752 "",
  strbeg=0xa64750 "\307\200", minend=0, sv=0xa61430, data=0x0, flags=97)
  at regexec.c​:3244
#11 0x00000000005988f1 in Perl_pp_match () at pp_hot.c​:1486
#12 0x0000000000545c42 in Perl_runops_debug () at dump.c​:2237
#13 0x0000000000460555 in S_run_body (oldscope=1) at perl.c​:2427
#14 0x000000000045fb99 in perl_run (my_perl=0xa42010) at perl.c​:2350
#15 0x000000000041eee5 in main (argc=4, argv=0x7fffffffe638,
  env=0x7fffffffe660) at perlmain.c​:116
(gdb)

Hugo

@p5pRT
Copy link
Author

p5pRT commented Apr 8, 2016

From @khwilliamson

The entire \C construct, which this ticket is about, is being removed as of 5.24, so bugs with it are no longer relevant
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Apr 8, 2016

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented May 13, 2016

From @khwilliamson

Thank you for submitting this report. You have helped make Perl better.
 
With the release of Perl 5.24.0 on May 9, 2016, this and 149 other issues have been resolved.

Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0

@p5pRT
Copy link
Author

p5pRT commented May 13, 2016

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant