Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp engine reads 1 beyond the string #10230

Closed
p5pRT opened this issue Mar 12, 2010 · 7 comments
Closed

regexp engine reads 1 beyond the string #10230

p5pRT opened this issue Mar 12, 2010 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 12, 2010

Migrated from rt.perl.org#73542 (status was 'resolved')

Searchable as RT73542$

@p5pRT
Copy link
Author

p5pRT commented Mar 12, 2010

From @nwc10

Created by @nwc10

The regexp engine often reads 1 character beyond the end of the string,
before deciding that it doesn't need the value. If the byte 1 beyond the
end of the string doesn't exist, then this will SEGV.

This can be seen as a bug, or can be seen as wishlist. It's also old, and
probably dates from 5.000, if not 1.000. I know that Adrian Enache hit this,
but I don't know if there is an open bug report.

$ valgrind /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MFile​::Map -e 'File​::Map​::map_anonymous($a, 4096); $a =~ /\0+/'
==2541== Memcheck, a memory error detector.
==2541== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==2541== Using LibVEX rev 1854, a library for dynamic binary translation.
==2541== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==2541== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework.
==2541== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==2541== For more details, rerun with​: -v
==2541==
==2541== Invalid read of size 1
==2541== at 0x69D6DC​: S_regmatch (regexec.c​:5417)
==2541== by 0x68F284​: S_regtry (regexec.c​:2474)
==2541== by 0x68C67A​: Perl_regexec_flags (regexec.c​:2075)
==2541== by 0x54845C​: Perl_pp_match (pp_hot.c​:1362)
==2541== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2541== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2541== by 0x44A6C9​: perl_run (perl.c​:2233)
==2541== by 0x42007C​: main (perlmain.c​:117)
==2541== Address 0x4020000 is not stack'd, malloc'd or (recently) free'd
==2541==
==2541== Process terminating with default action of signal 11 (SIGSEGV)
==2541== Access not within mapped region at address 0x4020000
==2541== at 0x69D6DC​: S_regmatch (regexec.c​:5417)
==2541== by 0x68F284​: S_regtry (regexec.c​:2474)
==2541== by 0x68C67A​: Perl_regexec_flags (regexec.c​:2075)
==2541== by 0x54845C​: Perl_pp_match (pp_hot.c​:1362)
==2541== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2541== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2541== by 0x44A6C9​: perl_run (perl.c​:2233)
==2541== by 0x42007C​: main (perlmain.c​:117)
==2541==
==2541== ERROR SUMMARY​: 1 errors from 1 contexts (suppressed​: 17 from 2)
==2541== malloc/free​: in use at exit​: 734,151 bytes in 8,911 blocks.
==2541== malloc/free​: 16,092 allocs, 7,181 frees, 1,283,041 bytes allocated.
==2541== For counts of detected errors, rerun with​: -v
==2541== searching for pointers to 8,911 not-freed blocks.
==2541== checked 1,027,248 bytes.
==2541==
==2541== LEAK SUMMARY​:
==2541== definitely lost​: 2,199 bytes in 36 blocks.
==2541== possibly lost​: 0 bytes in 0 blocks.
==2541== still reachable​: 731,952 bytes in 8,875 blocks.
==2541== suppressed​: 0 bytes in 0 blocks.
==2541== Rerun with --leak-check=full to see details of leaked memory.
Segmentation fault

It's arguably "wishlist" because strictly, the scalar is not well formed,
according to the rules of the internals, because it doesn't have a '\0'
byte beyond the end. This is usually what saves us. However, it still means
that we are reading more than we need, and hence causing cache misses, and
potentially even page faults.

You can see what the structure of the SVs that File​::Map produces with

$ /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MDevel​::Peek -MFile​::Map -e 'File​::Map​::map_anonymous($a, 16); Dump($a)'
SV = PVMG(0xa0b260) at 0x9cc1f8
  REFCNT = 1
  FLAGS = (SMG,RMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x2ae5a1e2b000 "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
  CUR = 16
  LEN = 0
  MAGIC = 0x9cef20
  MG_VIRTUAL = 0x2ae5a1e2a480
  MG_PRIVATE = 19540
  MG_TYPE = PERL_MAGIC_uvar(U)
  MG_PTR = 0x9ced70 ""

and the "problem" again, as dump.c tries to access the byte beyond​:

$ valgrind /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MDevel​::Peek -MFile​::Map -e 'File​::Map​::map_anonymous($a, 4096); Dump($a)'
==2905== Memcheck, a memory error detector.
==2905== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==2905== Using LibVEX rev 1854, a library for dynamic binary translation.
==2905== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==2905== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework.
==2905== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==2905== For more details, rerun with​: -v
==2905==
SV = PVMG(0x5e52790) at 0x5c6b380
  REFCNT = 1
  FLAGS = (SMG,RMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x401f000 ==2905== Invalid read of size 1
==2905== at 0x4F12F4​: Perl_pv_escape (dump.c​:302)
==2905== by 0x4F169A​: Perl_pv_pretty (dump.c​:383)
==2905== by 0x4F1889​: Perl_pv_display (dump.c​:419)
==2905== by 0x4FA7A4​: Perl_do_sv_dump (dump.c​:1655)
==2905== by 0x60568D4​: XS_Devel__Peek_Dump (Peek.xs​:346)
==2905== by 0x557C72​: Perl_pp_entersub (pp_hot.c​:2882)
==2905== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2905== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2905== by 0x44A6C9​: perl_run (perl.c​:2233)
==2905== by 0x42007C​: main (perlmain.c​:117)
==2905== Address 0x4020000 is not stack'd, malloc'd or (recently) free'd
==2905==
==2905== Process terminating with default action of signal 11 (SIGSEGV)
==2905== Access not within mapped region at address 0x4020000
==2905== at 0x4F12F4​: Perl_pv_escape (dump.c​:302)
==2905== by 0x4F169A​: Perl_pv_pretty (dump.c​:383)
==2905== by 0x4F1889​: Perl_pv_display (dump.c​:419)
==2905== by 0x4FA7A4​: Perl_do_sv_dump (dump.c​:1655)
==2905== by 0x60568D4​: XS_Devel__Peek_Dump (Peek.xs​:346)
==2905== by 0x557C72​: Perl_pp_entersub (pp_hot.c​:2882)
==2905== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2905== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2905== by 0x44A6C9​: perl_run (perl.c​:2233)
==2905== by 0x42007C​: main (perlmain.c​:117)
==2905==
==2905== ERROR SUMMARY​: 1 errors from 1 contexts (suppressed​: 26 from 2)
==2905== malloc/free​: in use at exit​: 794,203 bytes in 9,496 blocks.
==2905== malloc/free​: 18,210 allocs, 8,714 frees, 5,575,543 bytes allocated.
==2905== For counts of detected errors, rerun with​: -v
==2905== searching for pointers to 9,496 not-freed blocks.
==2905== checked 1,089,288 bytes.
==2905==
==2905== LEAK SUMMARY​:
==2905== definitely lost​: 2,199 bytes in 36 blocks.
==2905== possibly lost​: 0 bytes in 0 blocks.
==2905== still reachable​: 792,004 bytes in 9,460 blocks.
==2905== suppressed​: 0 bytes in 0 blocks.
==2905== Rerun with --leak-check=full to see details of leaked memory.
Segmentation fault

(sort of can't fix that one).

It would be good to change the regexp code in question, which currently
looks like this​:

  /* Note that nextchr is a byte even in UTF */
  nextchr = UCHARAT(locinput);
  scan = prog;
  while (scan != NULL) {

The "quicker" fix looks to be set nextchr to 0 if locinput >= PL_regeol
The more elegant fix (may not be possible) looks to be to defer reading
UCHARAT() until the later code knows that it needs it.

It looks like/I assume that the code retains the basic structure of Henry
Spencer's regexp engine, and that that was built to work on C NUL terminated
strings.

Nicholas Clark

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.11.5:

Configured by nick at Fri Mar 12 16:31:25 GMT 2010.

Summary of my perl5 (revision 5 version 11 subversion 5) configuration:
  Commit id: 801ed997c7a7937af6eb7d7e84217db79179b4f4
  Platform:
    osname=linux, osvers=2.6.18.8-xenu, archname=x86_64-linux
    uname='linux eris 2.6.18.8-xenu #1 smp sat oct 3 10:27:42 bst 2009 x86_64 gnulinux '
    config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl -Dcf_email=nick@ccl4.org -Dperladmin=nick@ccl4.org -Dinc_version_list=  -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuse64bitall -Uusemymalloc -Duseperlio -Dprefix=~/Sandpit/snap5.9.x-v5.11.5-59-g801ed99 -Uusevendorprefix -Uvendorprefix=~/Sandpit/snap5.9.x-v5.11.5-59-g801ed99 -Dinstallman1dir=none -Dinstallman3dir=none -Uuserelocatableinc -de'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='ccache gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.7.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -g -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.11.5:
    lib
    /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/lib/perl5/site_perl/5.11.5/x86_64-linux
    /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/lib/perl5/site_perl/5.11.5
    /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/lib/perl5/5.11.5/x86_64-linux
    /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/lib/perl5/5.11.5
    .


Environment for perl 5.11.5:
    HOME=/home/nick
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/nick/bin:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/sbin:/usr/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2010

From @khwilliamson

Nicholas Clark (via RT) wrote​:

# New Ticket Created by Nicholas Clark
# Please include the string​: [perl #73542]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=73542 >

This is a bug report for perl from nick@​ccl4.org,
generated with the help of perlbug 1.39 running under perl 5.11.5.

-----------------------------------------------------------------
[Please describe your issue here]

The regexp engine often reads 1 character beyond the end of the string,
before deciding that it doesn't need the value. If the byte 1 beyond the
end of the string doesn't exist, then this will SEGV.

This can be seen as a bug, or can be seen as wishlist. It's also old, and
probably dates from 5.000, if not 1.000. I know that Adrian Enache hit this,
but I don't know if there is an open bug report.

$ valgrind /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MFile​::Map -e 'File​::Map​::map_anonymous($a, 4096); $a =~ /\0+/'
==2541== Memcheck, a memory error detector.
==2541== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==2541== Using LibVEX rev 1854, a library for dynamic binary translation.
==2541== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==2541== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework.
==2541== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==2541== For more details, rerun with​: -v
==2541==
==2541== Invalid read of size 1
==2541== at 0x69D6DC​: S_regmatch (regexec.c​:5417)
==2541== by 0x68F284​: S_regtry (regexec.c​:2474)
==2541== by 0x68C67A​: Perl_regexec_flags (regexec.c​:2075)
==2541== by 0x54845C​: Perl_pp_match (pp_hot.c​:1362)
==2541== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2541== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2541== by 0x44A6C9​: perl_run (perl.c​:2233)
==2541== by 0x42007C​: main (perlmain.c​:117)
==2541== Address 0x4020000 is not stack'd, malloc'd or (recently) free'd
==2541==
==2541== Process terminating with default action of signal 11 (SIGSEGV)
==2541== Access not within mapped region at address 0x4020000
==2541== at 0x69D6DC​: S_regmatch (regexec.c​:5417)
==2541== by 0x68F284​: S_regtry (regexec.c​:2474)
==2541== by 0x68C67A​: Perl_regexec_flags (regexec.c​:2075)
==2541== by 0x54845C​: Perl_pp_match (pp_hot.c​:1362)
==2541== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2541== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2541== by 0x44A6C9​: perl_run (perl.c​:2233)
==2541== by 0x42007C​: main (perlmain.c​:117)
==2541==
==2541== ERROR SUMMARY​: 1 errors from 1 contexts (suppressed​: 17 from 2)
==2541== malloc/free​: in use at exit​: 734,151 bytes in 8,911 blocks.
==2541== malloc/free​: 16,092 allocs, 7,181 frees, 1,283,041 bytes allocated.
==2541== For counts of detected errors, rerun with​: -v
==2541== searching for pointers to 8,911 not-freed blocks.
==2541== checked 1,027,248 bytes.
==2541==
==2541== LEAK SUMMARY​:
==2541== definitely lost​: 2,199 bytes in 36 blocks.
==2541== possibly lost​: 0 bytes in 0 blocks.
==2541== still reachable​: 731,952 bytes in 8,875 blocks.
==2541== suppressed​: 0 bytes in 0 blocks.
==2541== Rerun with --leak-check=full to see details of leaked memory.
Segmentation fault

It's arguably "wishlist" because strictly, the scalar is not well formed,
according to the rules of the internals, because it doesn't have a '\0'
byte beyond the end. This is usually what saves us. However, it still means
that we are reading more than we need, and hence causing cache misses, and
potentially even page faults.

You can see what the structure of the SVs that File​::Map produces with

$ /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MDevel​::Peek -MFile​::Map -e 'File​::Map​::map_anonymous($a, 16); Dump($a)'
SV = PVMG(0xa0b260) at 0x9cc1f8
REFCNT = 1
FLAGS = (SMG,RMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x2ae5a1e2b000 "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"
CUR = 16
LEN = 0
MAGIC = 0x9cef20
MG_VIRTUAL = 0x2ae5a1e2a480
MG_PRIVATE = 19540
MG_TYPE = PERL_MAGIC_uvar(U)
MG_PTR = 0x9ced70 ""

and the "problem" again, as dump.c tries to access the byte beyond​:

$ valgrind /home/nick/Sandpit/snap5.9.x-v5.11.5-59-g801ed99/bin/perl5.11.5 -MDevel​::Peek -MFile​::Map -e 'File​::Map​::map_anonymous($a, 4096); Dump($a)'
==2905== Memcheck, a memory error detector.
==2905== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==2905== Using LibVEX rev 1854, a library for dynamic binary translation.
==2905== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==2905== Using valgrind-3.3.1-Debian, a dynamic binary instrumentation framework.
==2905== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==2905== For more details, rerun with​: -v
==2905==
SV = PVMG(0x5e52790) at 0x5c6b380
REFCNT = 1
FLAGS = (SMG,RMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x401f000 ==2905== Invalid read of size 1
==2905== at 0x4F12F4​: Perl_pv_escape (dump.c​:302)
==2905== by 0x4F169A​: Perl_pv_pretty (dump.c​:383)
==2905== by 0x4F1889​: Perl_pv_display (dump.c​:419)
==2905== by 0x4FA7A4​: Perl_do_sv_dump (dump.c​:1655)
==2905== by 0x60568D4​: XS_Devel__Peek_Dump (Peek.xs​:346)
==2905== by 0x557C72​: Perl_pp_entersub (pp_hot.c​:2882)
==2905== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2905== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2905== by 0x44A6C9​: perl_run (perl.c​:2233)
==2905== by 0x42007C​: main (perlmain.c​:117)
==2905== Address 0x4020000 is not stack'd, malloc'd or (recently) free'd
==2905==
==2905== Process terminating with default action of signal 11 (SIGSEGV)
==2905== Access not within mapped region at address 0x4020000
==2905== at 0x4F12F4​: Perl_pv_escape (dump.c​:302)
==2905== by 0x4F169A​: Perl_pv_pretty (dump.c​:383)
==2905== by 0x4F1889​: Perl_pv_display (dump.c​:419)
==2905== by 0x4FA7A4​: Perl_do_sv_dump (dump.c​:1655)
==2905== by 0x60568D4​: XS_Devel__Peek_Dump (Peek.xs​:346)
==2905== by 0x557C72​: Perl_pp_entersub (pp_hot.c​:2882)
==2905== by 0x4FF223​: Perl_runops_debug (dump.c​:2049)
==2905== by 0x44B1FF​: S_run_body (perl.c​:2308)
==2905== by 0x44A6C9​: perl_run (perl.c​:2233)
==2905== by 0x42007C​: main (perlmain.c​:117)
==2905==
==2905== ERROR SUMMARY​: 1 errors from 1 contexts (suppressed​: 26 from 2)
==2905== malloc/free​: in use at exit​: 794,203 bytes in 9,496 blocks.
==2905== malloc/free​: 18,210 allocs, 8,714 frees, 5,575,543 bytes allocated.
==2905== For counts of detected errors, rerun with​: -v
==2905== searching for pointers to 9,496 not-freed blocks.
==2905== checked 1,089,288 bytes.
==2905==
==2905== LEAK SUMMARY​:
==2905== definitely lost​: 2,199 bytes in 36 blocks.
==2905== possibly lost​: 0 bytes in 0 blocks.
==2905== still reachable​: 792,004 bytes in 9,460 blocks.
==2905== suppressed​: 0 bytes in 0 blocks.
==2905== Rerun with --leak-check=full to see details of leaked memory.
Segmentation fault

(sort of can't fix that one).

It would be good to change the regexp code in question, which currently
looks like this​:

/\* Note that nextchr is a byte even in UTF \*/
nextchr = UCHARAT\(locinput\);
scan = prog;
while \(scan \!= NULL\) \{

The "quicker" fix looks to be set nextchr to 0 if locinput >= PL_regeol
The more elegant fix (may not be possible) looks to be to defer reading
UCHARAT() until the later code knows that it needs it.

It looks like/I assume that the code retains the basic structure of Henry
Spencer's regexp engine, and that that was built to work on C NUL terminated
strings.

Nicholas Clark

My guess is that it won't properly match a string that contains a NULL.

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 15, 2010

From @nwc10

On Sun, Mar 14, 2010 at 12​:37​:28PM -0600, karl williamson wrote​:

It looks like/I assume that the code retains the basic structure of Henry
Spencer's regexp engine, and that that was built to work on C NUL
terminated
strings.

My guess is that it won't properly match a string that contains a NULL.

That is my suspicion too, but I don't have any test cases.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2012

From @cpansprout

Fixed by 7016d6e

@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2012

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Sep 26, 2012
@p5pRT
Copy link
Author

p5pRT commented Sep 26, 2012

From @nwc10

On Wed Sep 26 09​:07​:30 2012, sprout wrote​:

Fixed by 7016d6e

Thanks Dave, for fixing this.
(And thanks, sprout, for being on top of which fixes map to which tickets.)

Nicholas Clark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant