Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow variable length lookbehind for folded #16212

Closed
p5pRT opened this issue Oct 28, 2017 · 18 comments
Closed

Allow variable length lookbehind for folded #16212

p5pRT opened this issue Oct 28, 2017 · 18 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 28, 2017

Migrated from rt.perl.org#132367 (status was 'resolved')

Searchable as RT132367$

@p5pRT
Copy link
Author

p5pRT commented Oct 28, 2017

From @khwilliamson

Created by @khwilliamson

This is a bug report for perl from khw@​cpan.org,
generated with the help of perlbug 1.40 running under perl 5.27.6.

-----------------------------------------------------------------
See the thread beginning with
http​://nntp.perl.org/group/perl.perl5.porters/245323
Some negative lookbehind assertions were inadvertently broken by moving
to Unicode folding rules. It should be fairly easy to fix this for
these limited cases.

Perl Info

Flags:
     category=core
     severity=medium

Site configuration information for perl 5.27.6:

Configured by khw at Sat Oct 28 12:44:02 MDT 2017.

Summary of my perl5 (revision 5 version 27 subversion 6) configuration:
   Derived from: 72e3589821b9340dca575b7367c601d819d273f9
   Platform:
     osname=linux
     osvers=4.10.0-37-generic
     archname=x86_64-linux-thread-multi-ld
     uname='linux khw 4.10.0-37-generic #41-ubuntu smp fri oct 6 
20:20:37 utc 2017 x86_64 x86_64 x86_64 gnulinux '
     config_args='-des -Uversiononly -Dprefix=/home/khw/devel -Dusedevel 
-A'optimize=-ggdb3' -A'optimize=-O0' -Accflags='-Wno-c++11-compat' 
-Accflags='-DPERL_BOOL_AS_CHAR' -Accflags='-Wno-deprecated' 
-Accflags='-DPERL_EXTERNAL_GLOB' -Dman1dir=none -Dman3dir=none -Dcc=g++ 
-DDEBUGGING -Dusemorebits -Dusecbacktrace -Dusethreads'
     hint=recommended
     useposix=true
     d_sigaction=define
     useithreads=define
     usemultiplicity=define
     use64bitint=define
     use64bitall=define
     uselongdouble=define
     usemymalloc=n
     default_inc_excludes_dot=define
     bincompat5005=undef
   Compiler:
     cc='g++'
     ccflags ='-D_REENTRANT -D_GNU_SOURCE -Wno-c++11-compat 
-DPERL_BOOL_AS_CHAR -Wno-deprecated -DPERL_EXTERNAL_GLOB -fwrapv 
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong 
-I/usr/local/include -DUSE_C_BACKTRACE -g -D_LARGEFILE_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
     optimize='-O2 -ggdb3 -O0'
     cppflags='-D_REENTRANT -D_GNU_SOURCE -Wno-c++11-compat 
-DPERL_BOOL_AS_CHAR -Wno-deprecated -DPERL_EXTERNAL_GLOB -fwrapv 
-DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong 
-I/usr/local/include'
     ccversion=''
     gccversion='6.3.0 20170406'
     gccosandvers=''
     intsize=4
     longsize=8
     ptrsize=8
     doublesize=8
     byteorder=12345678
     doublekind=3
     d_longlong=define
     longlongsize=8
     d_longdbl=define
     longdblsize=16
     longdblkind=3
     ivtype='long'
     ivsize=8
     nvtype='long double'
     nvsize=16
     Off_t='off_t'
     lseeksize=8
     alignbytes=16
     prototype=define
   Linker and Libraries:
     ld='g++'
     ldflags =' -fstack-protector-strong -L/usr/local/lib'
     libpth=/usr/include/c++/6 /usr/include/x86_64-linux-gnu/c++/6 
/usr/include/c++/6/backward /usr/local/lib 
/usr/lib/gcc/x86_64-linux-gnu/6/include-fixed 
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib 
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
     libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
     libc=libc-2.24.so
     so=so
     useshrplib=false
     libperl=libperl.a
     gnulibc_version='2.24'
   Dynamic Linking:
     dlsrc=dl_dlopen.xs
     dlext=so
     d_dlsymun=undef
     ccdlflags='-Wl,-E'
     cccdlflags='-fPIC'
     lddlflags='-shared -O2 -ggdb3 -O0 -L/usr/local/lib 
-fstack-protector-strong'

Locally applied patches:
     uncommitted-changes


@INC for perl 5.27.6:
     /home/khw/perl/neglookbehind/lib
     /home/khw/perl/neglookbehind/t
     /home/khw/devel/lib/perl5/site_perl/5.27.6/x86_64-linux-thread-multi-ld
     /home/khw/devel/lib/perl5/site_perl/5.27.6
     /home/khw/devel/lib/perl5/5.27.6/x86_64-linux-thread-multi-ld
     /home/khw/devel/lib/perl5/5.27.6
     /home/khw/devel/lib/perl5/site_perl/5.26.0
     /home/khw/devel/lib/perl5/site_perl


Environment for perl 5.27.6:
     HOME=/home/khw
     LANG=en_US.UTF-8
     LANGUAGE=en_US
     LD_LIBRARY_PATH (unset)
     LOGDIR (unset)
 
PATH=/usr/lib/ccache:/home/khw/bin:/home/khw/perl5/perlbrew/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/usr/local/games:/home/khw/iands/www:/home/khw/cxoffice/bin
     PERL5OPT=-w
     PERL_BADLANG (unset)
     PERL_DIFF_TOOL=wgdiff
     PERL_POD_PEDANTIC=1
     SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Oct 30, 2018

From jsailor@techhouse.org

Created by jsailor@techhouse.org

Running

  perl -le 'print if /h(?<!ssh)base/i'

produces the error

  Variable length lookbehind not implemented in regex m/h(?<!ssh)base/ at -e line 1.

which is odd, because these work fine

  perl -le 'print if /h(?<!xxh)base/i'
  perl -le 'print if /h(?<!xxh)baxe/i'

works fine.

Bug report is on a debian system running 5.20, but it also reproduces on 5.16
CentOS with the following `perl -V`​:

  Summary of my perl5 (revision 5 version 16 subversion 3) configuration​:
 
  Platform​:
  osname=linux, osvers=3.10.0-514.16.1.el7.x86_64, archname=x86_64-linux-thread-multi
  uname='linux c1bm.rdu2.centos.org 3.10.0-514.16.1.el7.x86_64 #1 smp wed apr 12 15​:04​:24 utc 2017 x86_64 x86_64 x86_64 gnulinux '
  config_args='-des -Doptimize=-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Dccdlflags=-Wl,--enable-new-dtags -Dlddlflags=-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro -DDEBUGGING=-g -Dversion=5.16.3 -Dmyhostname=localhost -Dperladmin=root@​localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dprefix=/usr -Dvendorprefix=/usr -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl5 -Dsitearch=/usr/local/lib64/perl5 -Dprivlib=/usr/share/perl5 -Dvendorlib=/usr/share/perl5/vendor_perl -Darchlib=/usr/lib64/perl5 -Dvendorarch=/usr/lib64/perl5/vendor_perl -Darchname=x86_64-linux-thread-multi -Dlibpth=/usr/local/lib64 /lib64 /usr/lib64 -Duseshrplib -Dusethreads -Duseithreads -Dusedtrace=/usr/bin/dtrace -Duselargefiles -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_sha
dow -Di_
syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_sethostent_r_proto -Ud_endprotoent_r_proto -Ud_setprotoent_r_proto -Ud_endservent_r_proto -Ud_setservent_r_proto -Dscriptdir=/usr/bin -Dusesitecustomize'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.8.5 20150623 (Red Hat 4.8.5-16)', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='gcc', ldflags =' -fstack-protector'
  libpth=/usr/local/lib64 /lib64 /usr/lib64
  libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc -lgdbm_compat
  perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=, so=so, useshrplib=true, libperl=libperl.so
  gnulibc_version='2.17'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,--enable-new-dtags -Wl,-rpath,/usr/lib64/perl5/CORE'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wl,-z,relro '

  Characteristics of this binary (from libperl)​:
  Compile-time options​: HAS_TIMES MULTIPLICITY PERLIO_LAYERS
  PERL_DONT_CREATE_GVSV PERL_IMPLICIT_CONTEXT
  PERL_MALLOC_WRAP PERL_PRESERVE_IVUV USE_64_BIT_ALL
  USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES
  USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE
  USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF
  USE_REENTRANT_API USE_SITECUSTOMIZE
  Locally applied patches​:
  Fedora Patch1​: Removes date check, Fedora/RHEL specific
  Fedora Patch3​: support for libdir64
  Fedora Patch4​: use libresolv instead of libbind
  Fedora Patch5​: USE_MM_LD_RUN_PATH
  Fedora Patch6​: Skip hostname tests, due to builders not being network capable
  Fedora Patch7​: Dont run one io test due to random builder failures
  Fedora Patch9​: Fix find2perl to translate ? glob properly (RT#113054)
  Fedora Patch10​: Fix broken atof (RT#109318)
  Fedora Patch13​: Clear $@​ before "do" I/O error (RT#113730)
  Fedora Patch14​: Do not truncate syscall() return value to 32 bits (RT#113980)
  Fedora Patch15​: Override the Pod​::Simple​::parse_file (CPANRT#77530)
  Fedora Patch16​: Do not leak with attribute on my variable (RT#114764)
  Fedora Patch17​: Allow operator after numeric keyword argument (RT#105924)
  Fedora Patch18​: Extend stack in File​::Glob​::glob, (RT#114984)
  Fedora Patch19​: Do not crash when vivifying $|
  Fedora Patch20​: Fix misparsing of maketext strings (CVE-2012-6329)
  Fedora Patch21​: Add NAME headings to CPAN modules (CPANRT#73396)
  Fedora Patch22​: Fix leaking tied hashes (RT#107000) [1]
  Fedora Patch23​: Fix leaking tied hashes (RT#107000) [2]
  Fedora Patch24​: Fix leaking tied hashes (RT#107000) [3]
  Fedora Patch25​: Fix dead lock in PerlIO after fork from thread (RT#106212)
  Fedora Patch26​: Make regexp safe in a signal handler (RT#114878)
  Fedora Patch27​: Update h2ph(1) documentation (RT#117647)
  Fedora Patch28​: Update pod2html(1) documentation (RT#117623)
  Fedora Patch29​: Document Math​::BigInt​::CalcEmu requires Math​::BigInt (CPAN RT#85015)
  RHEL Patch30​: Use stronger algorithm needed for FIPS in t/op/crypt.t (RT#121591)
  RHEL Patch31​: Make *DBM_File desctructors thread-safe (RT#61912)
  RHEL Patch32​: Use stronger algorithm needed for FIPS in t/op/taint.t (RT#123338)
  RHEL Patch33​: Remove CPU-speed-sensitive test in Benchmark test
  RHEL Patch34​: Make File​::Glob work with threads again
  RHEL Patch35​: Fix CRLF conversion in ASCII FTP upload (CPAN RT#41642)
  RHEL Patch36​: Do not leak the temp utf8 copy of namepv (CPAN RT#123786)
  RHEL Patch37​: Fix duplicating PerlIO​::encoding when spawning threads (RT#31923)
  Built under linux
  Compiled at Aug 2 2017 17​:45​:03
  @​INC​:
  /usr/local/lib64/perl5
  /usr/local/share/perl5
  /usr/lib64/perl5/vendor_perl
  /usr/share/perl5/vendor_perl
  /usr/lib64/perl5
  /usr/share/perl5
  .

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.20.2:

Configured by Debian Project at Sun Jun 10 18:32:44 UTC 2018.

Summary of my perl5 (revision 5 version 20 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=4.9.0-6-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux themisto 4.9.0-6-amd64 #1 smp debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dusesitecustomize -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.9.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector'

Locally applied patches:
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
    DEBPKG:fixes/respect_umask - Respect umask during installation
    DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories
    DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib
    DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
    DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
    DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
    DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
    DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3+deb8u11 in patchlevel.h
    DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
    DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
    DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text
    DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
    DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
    DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected
    DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories
    DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug
    DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories
    DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences
    DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer
    DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle
    DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test
    DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd
    DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling
    DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require
    DEBPKG:fixes/array-cloning - http://bugs.debian.org/779357 [perl #124127] [902d169] fix cloning arrays with unused elements
    DEBPKG:fixes/perldb-threads - http://bugs.debian.org/779357 [perl #124127] [41ef2c6] lib/perl5db.pl: Restore noop lock prototype
    DEBPKG:fixes/CVE-2015-8607_file_spec_taint_fix - ensure File::Spec::canonpath() preserves taint
    DEBPKG:fixes/encode-unicode-bom - http://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
    DEBPKG:debian/encode-unicode-bom-doc - http://bugs.debian.org/798727 Document Debian backport of Encode::Unicode fix
    DEBPKG:debian/kfreebsd-softupdates - http://bugs.debian.org/796798 Work around Debian Bug#796798
    DEBPKG:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ
    DEBPKG:debian/debugperl-compat-fix - [perl #127212] http://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
    DEBPKG:fixes/CVE-2015-8853_regexp_hang - http://bugs.debian.org/821848 [perl #123562] PATCH [perl #123562] Regexp-matching "hangs"
    DEBPKG:fixes/utf8_regexp_crash - http://bugs.debian.org/820328 [perl #124109] save_re_context(): do "local $n" with no PL_curpm
    DEBPKG:fixes/regcomp_whitespace_fix - http://bugs.debian.org/820328 [perl #124109] Perl_save_re_context(): re-indent after last commit
    DEBPKG:fixes/5.20.3/eval_label_crash - http://bugs.debian.org/822336 [perl #123652] eval {label:} crash
    DEBPKG:fixes/5.20.3/preserve_record_separator - http://bugs.debian.org/822336 [perl #123218] "preserve" $/ if set to a bad value
    DEBPKG:fixes/5.20.3/test_count_base_rs - http://bugs.debian.org/822336 Fix test count in t/base/rs.t
    DEBPKG:fixes/5.20.3/remove_get_magic - http://bugs.debian.org/822336 [perl #123739] Remove get-magic from $/
    DEBPKG:fixes/5.20.3/speed_up_scalar_g - http://bugs.debian.org/822336 [perl #123202] speed up scalar //g against tainted strings
    DEBPKG:fixes/5.20.3/accidental_all_features - http://bugs.debian.org/822336 Stop $^H |= 0x1c020000 from enabling all features
    DEBPKG:fixes/5.20.3/multidimensional_arrays_utf8 - http://bugs.debian.org/822336 [perl #124113] Make check for multi-dimensional arrays be UTF8-aware
    DEBPKG:fixes/5.20.3/unquoted_utf8_heredoc_terminators - http://bugs.debian.org/822336 Allow unquoted UTF-8 HERE-document terminators
    DEBPKG:fixes/5.20.3/parentheses_ambiguous_warning_utf8_functions - http://bugs.debian.org/822336 Fix "...without parentheses is ambuguous" warning for UTF-8 function names
    DEBPKG:fixes/5.20.3/leak_namepv_copy - http://bugs.debian.org/822336 [perl #123786] don't leak the temp utf8 copy of namepv
    DEBPKG:fixes/5.20.3/h2ph_hex_constants - http://bugs.debian.org/822336 h2ph: correct handling of hex constants for the preamble
    DEBPKG:fixes/5.20.3/leftbracket_XTERMORDORDOR - http://bugs.debian.org/822336 [perl #123711] Fix crash with 0-5x-l{0}
    DEBPKG:fixes/5.20.3/fatalize_warnings_unwinding - http://bugs.debian.org/822336 [perl #123398] don't fatalize warnings during unwinding (#123398)
    DEBPKG:fixes/5.20.3/setpgrp - http://bugs.debian.org/822336 =?UTF-8?q?Don=E2=80=99t=20treat=20setpgrp($nonzero)=20as=20setpgr?= =?UTF-8?q?p(1)?=
    DEBPKG:fixes/5.20.3/death_unwinding_crash - http://bugs.debian.org/822336 [perl #124156] RT #124156: death during unwinding causes crash
    DEBPKG:fixes/5.20.3/stashpvn_crash - http://bugs.debian.org/822336 [perl #125541] Fix crash with %::=(); J->${\"::"}
    DEBPKG:fixes/5.20.3/possessive_quantifier - http://bugs.debian.org/822336 [perl #125825] PATCH: [perl 125825] {n}+ possessive quantifier broken
    DEBPKG:fixes/5.20.3/quoted_code_crash - http://bugs.debian.org/822336 [perl #123712] Fix /$a[/ parsing
    DEBPKG:fixes/5.20.3/checking_sub_inwhat - http://bugs.debian.org/822336 [perl #123712] Don't check sub_inwhat
    DEBPKG:fixes/5.20.3/yylex_loop - http://bugs.debian.org/822336 Fix hang with "@{"
    DEBPKG:fixes/5.20.3/docs/op - http://bugs.debian.org/822336 Fix apidocs for OP_TYPE_IS(_OR_WAS) - arguments separated by |, not ,.
    DEBPKG:fixes/5.20.3/docs/encoding - http://bugs.debian.org/822336 perlpodspec: Corrections/adds to detecting =encoding
    DEBPKG:fixes/5.20.3/docs/SvPV_set - http://bugs.debian.org/822336 improve SvPV_set's docs, it really shouldn't be public API
    DEBPKG:fixes/5.20.3/docs/autodie - http://bugs.debian.org/822336 Fix warning message regarding "use autodie" and "use open".
    DEBPKG:fixes/5.20.3/docs/autodie_2_26 - http://bugs.debian.org/822336 perlunicook: Note that autodie >= 2.26 should be okay with "use open".
    DEBPKG:fixes/5.20.3/docs/setenv - http://bugs.debian.org/822336 Fix setenv() replacement documentation in perlclib
    DEBPKG:fixes/5.20.3/docs/clib_caution - http://bugs.debian.org/822336 perlhacktips: Add caution about clib ptr returns to static memory
    DEBPKG:fixes/5.20.3/docs/perlunicook_typos - http://bugs.debian.org/822336 Fix minor code typos in perlunicook
    DEBPKG:fixes/5.20.3/docs/ook_example - http://bugs.debian.org/822336 [perl #122322] Update OOK example in perlguts
    DEBPKG:fixes/5.20.3/docs/study_noop - http://bugs.debian.org/822336 perlfunc: mention that study() is currently a noop
    DEBPKG:fixes/CVE-2016-1238/remove-dot-when-loading - [perl #127834] (perl #127834) remove . from the end of @INC if complex modules are loaded
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-padwalker - [perl #127834] perl5db.pl: ensure PadWalker is loaded from standard paths
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-dist - [perl #127834] dist/: remove . from @INC when loading optional modules
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-cpan - [perl #127834] cpan/: remove . from @INC when loading optional modules
    DEBPKG:fixes/CVE-2016-1238/customized-encode - Update customized.dat for cpan/Encode/Encode.pm
    DEBPKG:debian/CVE-2016-1238/test-suite-without-dot - [perl #127810] Patch unit tests to explicitly insert "." into @INC when needed.
    DEBPKG:debian/CVE-2016-1238/eumm-without-dot - [perl #127810] Add PERL_USE_UNSAFE_INC support to EU::MM for fortify_inc support.
    DEBPKG:debian/CVE-2016-1238/cpan-without-dot - [perl #127810] Set PERL_USE_UNSAFE_INC for cpan usage
    DEBPKG:debian/CVE-2016-1238/mb-without-dot - Make Module::Build set PERL_USE_UNSAFE_INC
    DEBPKG:debian/CVE-2016-1238/sitecustomize-in-etc - Look for sitecustomize.pl in /etc/perl rather than sitelib on Debian systems
    DEBPKG:fixes/xsloader-eval - [rt.cpan.org #115808] http://bugs.debian.org/829578 =?UTF-8?q?Don=E2=80=99t=20let=20XSLoader=20load=20relative=20path?= =?UTF-8?q?s?=
    DEBPKG:fixes/file_path_chmod_race - http://bugs.debian.org/863870 [rt.cpan.org #121951] Prevent directory chmod race attack.
    DEBPKG:fixes/extutils_file_path_compat - [PATCH] Correct the order of tests of chmod(). (#294)
    DEBPKG:debian/customized_file_path - Update customized.dat for File-Path changes
    DEBPKG:debian/CVE-2016-1238/base-pm-amends-pt1 - Revert base.pm no-dot-in-inc fixes to make way for a better version
    DEBPKG:debian/CVE-2016-1238/base-pm-amends-pt2 - [1afa289] Limit dotless-INC effect on base.pm with guard:
    DEBPKG:fixes/CVE-2017-12837 - http://bugs.debian.org/875596 [perl #131582] regcomp [perl #131582]
    DEBPKG:fixes/CVE-2017-12883 - http://bugs.debian.org/875597 [perl #131598] PATCH: [perl #131598]
    DEBPKG:fixes/CVE-2017-12883-5.20 - http://bugs.debian.org/875597 [perl #131598] regcomp: Fix out of bound reads
    DEBPKG:fixes/CVE-2018-6913 - [perl #131844] (perl #131844) fix various space calculation issues in pp_pack.c
    DEBPKG:fixes/CVE-2018-12015-Archive-Tar-directory-traversal - http://bugs.debian.org/900834 [rt.cpan.org #125523] Remove existing files before overwriting them


@INC for perl 5.20.2:
    /etc/perl
    /usr/local/lib/x86_64-linux-gnu/perl/5.20.2
    /usr/local/share/perl/5.20.2
    /usr/lib/x86_64-linux-gnu/perl5/5.20
    /usr/share/perl5
    /usr/lib/x86_64-linux-gnu/perl/5.20
    /usr/share/perl/5.20
    /usr/local/lib/site_perl


Environment for perl 5.20.2:
    HOME=/home/jsailor
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/th/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Oct 30, 2018

From @tonycoz

On Tue, 30 Oct 2018 14​:12​:05 -0700, jsailor@​techhouse.org wrote​:

Running

perl -le 'print if /h(?<!ssh)base/i'

produces the error

Variable length lookbehind not implemented in regex m/h(?<!ssh)base/
at -e line 1.

which is odd, because these work fine

perl -le 'print if /h(?<!xxh)base/i'
perl -le 'print if /h(?<!xxh)baxe/i'

works fine.

I believe it's the "ss", which can match ß (\xdf, LATIN SMALL LETTER SHARP S), which makes it variable length.

Some ligatures can cause the same problem​:

$ perl -E 'qr/h(?<!ff)base/i'
Variable length lookbehind not implemented in regex m/h(?<!ff)base/ at -e line 1.
$ perl -E 'qr/h(?<!fi)base/i'
Variable length lookbehind not implemented in regex m/h(?<!fi)base/ at -e line 1.

If you only need to match ASCII case-insensitively you can use the /aa flag​:

$ perl -E 'qr/h(?<!ssh)base/aai'

If you're perl is too old for the aa flag, I think you're stuck with using character classes and no /i flag.

Tony

@p5pRT
Copy link
Author

p5pRT commented Oct 30, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 31, 2018

From @khwilliamson

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or collaboration from someone like Yves or Hugo to get started.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 31, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 1, 2018

From @hvds

On Wed, 31 Oct 2018 10​:07​:14 -0700, khw wrote​:

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or
collaboration from someone like Yves or Hugo to get started.

I'd be happy to collaborate, but I'm not confident I currently know any viable approach (and in particular I don't know what algorithm Yves had in mind in
https://www.nntp.perl.org/group/perl.perl5.porters/2017/07/msg245348.html). Do you already have some avenue of attack in mind?

The rough classes of direction I can think of are​:
A) new flag to disable mixed-length casefolds
B) support variable-width lookbehind
B-1) by trying all start points (conceptually, culled by optimizer)
B-2) by trying a calculated range of start points (minlen..maxlen)
B-3) by trying one start point, calculated at the point of match
B-4) by reversal
.. and for all (B) cases, could be aiming to support (a) all lookbehind patterns, or (B) only compile-time-detectable "simple enough" ones.

FWIW, I had another use-case a couple of days ago where an elegant solution to a problem would have required a different special case, a lookbehind that was variable-width but anchored to start. Things like that, and cases where you want literal match of something unknown at compile time (eg a capture), would probably be simple extensions of what we have now, and easily fit in the B-3b approach.

I'd be tempted to try such an extension first as an exploratory measure, to get a better feel for what might be possible when the pattern is truly variable.

Hugo

@p5pRT
Copy link
Author

p5pRT commented Nov 4, 2018

From jsailor@techhouse.org

On Tue, 30 Oct 2018, Tony Cook via RT wrote​:

perl -le 'print if /h(?<!ssh)base/i'

produces the error

Variable length lookbehind not implemented in regex m/h(?<!ssh)base/
at -e line 1.

I believe it's the "ss", which can match ß (\xdf, LATIN SMALL LETTER
SHARP S), which makes it variable length.

Yup, that's it all right. It's even listed as an example in perldiag,
right under my nose :/

Though, I'm a little surprised that

  LANG=C LC_ALL=C perl -e 'qr/(?<!ss)/di'

hits this. My expectation would be that, since /d means "to use the
'Default' native rules of the platform except when there is cause to use
Unicode rules instead, as follows​: [various conditions that wouldn't
apply unless/until the rx is matched against a utf8 string]" and /i. I
guess the engine has to compile it to match both "ss" and \N{LATIN
SMALL LETTER SHARP S}, and then for non-utf8 strings, the second branch
just never gets hit or something.

Feel free to close as RTFM I guess.

~jon.

@p5pRT
Copy link
Author

p5pRT commented Nov 12, 2018

From @khwilliamson

On 11/1/18 2​:07 AM, Hugo van der Sanden via RT wrote​:

On Wed, 31 Oct 2018 10​:07​:14 -0700, khw wrote​:

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or
collaboration from someone like Yves or Hugo to get started.

I'd be happy to collaborate, but I'm not confident I currently know any viable approach (and in particular I don't know what algorithm Yves had in mind in
https://www.nntp.perl.org/group/perl.perl5.porters/2017/07/msg245348.html). Do you already have some avenue of attack in mind?

The rough classes of direction I can think of are​:
A) new flag to disable mixed-length casefolds
B) support variable-width lookbehind
B-1) by trying all start points (conceptually, culled by optimizer)
B-2) by trying a calculated range of start points (minlen..maxlen)
B-3) by trying one start point, calculated at the point of match
B-4) by reversal
.. and for all (B) cases, could be aiming to support (a) all lookbehind patterns, or (B) only compile-time-detectable "simple enough" ones.

FWIW, I had another use-case a couple of days ago where an elegant solution to a problem would have required a different special case, a lookbehind that was variable-width but anchored to start. Things like that, and cases where you want literal match of something unknown at compile time (eg a capture), would probably be simple extensions of what we have now, and easily fit in the B-3b approach.

I'd be tempted to try such an extension first as an exploratory measure, to get a better feel for what might be possible when the pattern is truly variable.

Hugo

Yves is apparently too busy to reply.

I looked at this yesterday, and came up with a scheme that I believe
will work for limited cases of folding length changes, essentially B-2.

The code that raises the variable-length lookbehind error is in the
optimizer. It raises the error when it detects that a delta is
non-zero. *If* this delta is accurate, the work for finding the
calculated range of start points has already been done for us.

The two regnode types that deal with lookbehind are IFMATCH and UNLESSM.
  The next_off field in both of them is apparently unused. In most
regnodes, this field contains the offset to the next regnode that gets
executed after the current one, and is a 16 bit value. But certain
regnodes have an extra, 32-bit, field for this purpose. These two are
among them.

This field could be used to store the delta.

At execution, regexec.c could use this delta to calculate minlen..maxlen
and try the match at each position in the range.

I don't think B-4 is viable as I understand it. It doesn't work to
start at the end and work backwards, as in a multi-char fold, you'd see
the final character first, and would have to keep going in order to find
the beginning of the fold. A work-around would be to generate a list of
the relatively few characters that are like this. This would be simple
to do. Then, when one of them is encountered, add special handling for it.

I don't like option A. I'd rather not increase the cognitive load on
our users. The modifier /aa already is available for the most common cases.

B-2 seems better than B-1 to me, and apparently easy to implement.

B-3 doesn't work, as the start point does vary, unless I misunderstand
what you meant.

@p5pRT
Copy link
Author

p5pRT commented Nov 13, 2018

From @demerphq

On Mon, 12 Nov 2018 at 18​:17, Karl Williamson <public@​khwilliamson.com> wrote​:

On 11/1/18 2​:07 AM, Hugo van der Sanden via RT wrote​:

On Wed, 31 Oct 2018 10​:07​:14 -0700, khw wrote​:

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or
collaboration from someone like Yves or Hugo to get started.

I'd be happy to collaborate, but I'm not confident I currently know any viable approach (and in particular I don't know what algorithm Yves had in mind in
https://www.nntp.perl.org/group/perl.perl5.porters/2017/07/msg245348.html). Do you already have some avenue of attack in mind?

The rough classes of direction I can think of are​:
A) new flag to disable mixed-length casefolds
B) support variable-width lookbehind
B-1) by trying all start points (conceptually, culled by optimizer)
B-2) by trying a calculated range of start points (minlen..maxlen)
B-3) by trying one start point, calculated at the point of match
B-4) by reversal
.. and for all (B) cases, could be aiming to support (a) all lookbehind patterns, or (B) only compile-time-detectable "simple enough" ones.

FWIW, I had another use-case a couple of days ago where an elegant solution to a problem would have required a different special case, a lookbehind that was variable-width but anchored to start. Things like that, and cases where you want literal match of something unknown at compile time (eg a capture), would probably be simple extensions of what we have now, and easily fit in the B-3b approach.

I'd be tempted to try such an extension first as an exploratory measure, to get a better feel for what might be possible when the pattern is truly variable.

Hugo

Yves is apparently too busy to reply.

Guilt is a powerful motivator. :-)

I looked at this yesterday, and came up with a scheme that I believe
will work for limited cases of folding length changes, essentially B-2.

This problem is definitely more tractable than supporting arbitrary
expressions. I would also have taken the B-2 route, in fact, I think
practically speaking that is the only option.

The code that raises the variable-length lookbehind error is in the
optimizer. It raises the error when it detects that a delta is
non-zero. *If* this delta is accurate, the work for finding the
calculated range of start points has already been done for us.

The two regnode types that deal with lookbehind are IFMATCH and UNLESSM.
The next_off field in both of them is apparently unused. In most
regnodes, this field contains the offset to the next regnode that gets
executed after the current one, and is a 16 bit value. But certain
regnodes have an extra, 32-bit, field for this purpose. These two are
among them.

This field could be used to store the delta.

At execution, regexec.c could use this delta to calculate minlen..maxlen
and try the match at each position in the range.

Yes, this is basically what I would do.

I don't think B-4 is viable as I understand it. It doesn't work to
start at the end and work backwards, as in a multi-char fold, you'd see
the final character first, and would have to keep going in order to find
the beginning of the fold. A work-around would be to generate a list of
the relatively few characters that are like this. This would be simple
to do. Then, when one of them is encountered, add special handling for it.

I think B-4 *is* viable, its just that it is something like a PHD
thesis worth of work, so it is not *practically* viable.

I don't like option A. I'd rather not increase the cognitive load on
our users. The modifier /aa already is available for the most common cases.

Agreed.

B-2 seems better than B-1 to me, and apparently easy to implement.

Agreed.

B-3 doesn't work, as the start point does vary, unless I misunderstand
what you meant.

Agreed.

And B-0 is a performance nightmare waiting to happen.

I fully support B-2. I suspect there are some interesting question
about how it will interact with capture buffers however. I do not envy
you taking this on, I tried in the past and it hurt my brain.

Good luck!

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Dec 26, 2018

From @khwilliamson

On 11/13/18 4​:10 AM, demerphq wrote​:

On Mon, 12 Nov 2018 at 18​:17, Karl Williamson <public@​khwilliamson.com> wrote​:

On 11/1/18 2​:07 AM, Hugo van der Sanden via RT wrote​:

On Wed, 31 Oct 2018 10​:07​:14 -0700, khw wrote​:

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or
collaboration from someone like Yves or Hugo to get started.

I'd be happy to collaborate, but I'm not confident I currently know any viable approach (and in particular I don't know what algorithm Yves had in mind in
https://www.nntp.perl.org/group/perl.perl5.porters/2017/07/msg245348.html). Do you already have some avenue of attack in mind?

The rough classes of direction I can think of are​:
A) new flag to disable mixed-length casefolds
B) support variable-width lookbehind
B-1) by trying all start points (conceptually, culled by optimizer)
B-2) by trying a calculated range of start points (minlen..maxlen)
B-3) by trying one start point, calculated at the point of match
B-4) by reversal
.. and for all (B) cases, could be aiming to support (a) all lookbehind patterns, or (B) only compile-time-detectable "simple enough" ones.

FWIW, I had another use-case a couple of days ago where an elegant solution to a problem would have required a different special case, a lookbehind that was variable-width but anchored to start. Things like that, and cases where you want literal match of something unknown at compile time (eg a capture), would probably be simple extensions of what we have now, and easily fit in the B-3b approach.

I'd be tempted to try such an extension first as an exploratory measure, to get a better feel for what might be possible when the pattern is truly variable.

Hugo

Yves is apparently too busy to reply.

Guilt is a powerful motivator. :-)

I looked at this yesterday, and came up with a scheme that I believe
will work for limited cases of folding length changes, essentially B-2.

This problem is definitely more tractable than supporting arbitrary
expressions. I would also have taken the B-2 route, in fact, I think
practically speaking that is the only option.

The code that raises the variable-length lookbehind error is in the
optimizer. It raises the error when it detects that a delta is
non-zero. *If* this delta is accurate, the work for finding the
calculated range of start points has already been done for us.

The two regnode types that deal with lookbehind are IFMATCH and UNLESSM.
The next_off field in both of them is apparently unused. In most
regnodes, this field contains the offset to the next regnode that gets
executed after the current one, and is a 16 bit value. But certain
regnodes have an extra, 32-bit, field for this purpose. These two are
among them.

This field could be used to store the delta.

At execution, regexec.c could use this delta to calculate minlen..maxlen
and try the match at each position in the range.

Yes, this is basically what I would do.

I don't think B-4 is viable as I understand it. It doesn't work to
start at the end and work backwards, as in a multi-char fold, you'd see
the final character first, and would have to keep going in order to find
the beginning of the fold. A work-around would be to generate a list of
the relatively few characters that are like this. This would be simple
to do. Then, when one of them is encountered, add special handling for it.

I think B-4 *is* viable, its just that it is something like a PHD
thesis worth of work, so it is not *practically* viable.

I don't like option A. I'd rather not increase the cognitive load on
our users. The modifier /aa already is available for the most common cases.

Agreed.

B-2 seems better than B-1 to me, and apparently easy to implement.

Agreed.

B-3 doesn't work, as the start point does vary, unless I misunderstand
what you meant.

Agreed.

And B-0 is a performance nightmare waiting to happen.

I fully support B-2. I suspect there are some interesting question
about how it will interact with capture buffers however. I do not envy
you taking this on, I tried in the past and it hurt my brain.

Having looked at the code some, my changes don't affect any area with
capture buffers. Why do you think this would be different from other
lookaround assertions and their interactions with capture buffers?

Good luck!

Yves

@p5pRT
Copy link
Author

p5pRT commented Dec 27, 2018

From @demerphq

It's just a hazy memory that capture buffers inside of look around
operators can be problematic. If you don't see any test fails I guess you
are fine.

Merry Christmas!
Cheers
Yves

On Thu, 27 Dec 2018, 00​:50 Karl Williamson <public@​khwilliamson.com wrote​:

On 11/13/18 4​:10 AM, demerphq wrote​:

On Mon, 12 Nov 2018 at 18​:17, Karl Williamson <public@​khwilliamson.com>
wrote​:

On 11/1/18 2​:07 AM, Hugo van der Sanden via RT wrote​:

On Wed, 31 Oct 2018 10​:07​:14 -0700, khw wrote​:

#133630 is a duplicate of #132367. I have merged them together

I would like to fix this in 5.30. But I need some guidance and/or
collaboration from someone like Yves or Hugo to get started.

I'd be happy to collaborate, but I'm not confident I currently know
any viable approach (and in particular I don't know what algorithm Yves had
in mind in

https://www.nntp.perl.org/group/perl.perl5.porters/2017/07/msg245348.html).
Do you already have some avenue of attack in mind?

The rough classes of direction I can think of are​:
A) new flag to disable mixed-length casefolds
B) support variable-width lookbehind
B-1) by trying all start points (conceptually, culled by optimizer)
B-2) by trying a calculated range of start points (minlen..maxlen)
B-3) by trying one start point, calculated at the point of match
B-4) by reversal
.. and for all (B) cases, could be aiming to support (a) all
lookbehind patterns, or (B) only compile-time-detectable "simple enough"
ones.

FWIW, I had another use-case a couple of days ago where an elegant
solution to a problem would have required a different special case, a
lookbehind that was variable-width but anchored to start. Things like that,
and cases where you want literal match of something unknown at compile time
(eg a capture), would probably be simple extensions of what we have now,
and easily fit in the B-3b approach.

I'd be tempted to try such an extension first as an exploratory
measure, to get a better feel for what might be possible when the pattern
is truly variable.

Hugo

Yves is apparently too busy to reply.

Guilt is a powerful motivator. :-)

I looked at this yesterday, and came up with a scheme that I believe
will work for limited cases of folding length changes, essentially B-2.

This problem is definitely more tractable than supporting arbitrary
expressions. I would also have taken the B-2 route, in fact, I think
practically speaking that is the only option.

The code that raises the variable-length lookbehind error is in the
optimizer. It raises the error when it detects that a delta is
non-zero. *If* this delta is accurate, the work for finding the
calculated range of start points has already been done for us.

The two regnode types that deal with lookbehind are IFMATCH and UNLESSM.
The next_off field in both of them is apparently unused. In most
regnodes, this field contains the offset to the next regnode that gets
executed after the current one, and is a 16 bit value. But certain
regnodes have an extra, 32-bit, field for this purpose. These two are
among them.

This field could be used to store the delta.

At execution, regexec.c could use this delta to calculate minlen..maxlen
and try the match at each position in the range.

Yes, this is basically what I would do.

I don't think B-4 is viable as I understand it. It doesn't work to
start at the end and work backwards, as in a multi-char fold, you'd see
the final character first, and would have to keep going in order to find
the beginning of the fold. A work-around would be to generate a list of
the relatively few characters that are like this. This would be simple
to do. Then, when one of them is encountered, add special handling for
it.

I think B-4 *is* viable, its just that it is something like a PHD
thesis worth of work, so it is not *practically* viable.

I don't like option A. I'd rather not increase the cognitive load on
our users. The modifier /aa already is available for the most common
cases.

Agreed.

B-2 seems better than B-1 to me, and apparently easy to implement.

Agreed.

B-3 doesn't work, as the start point does vary, unless I misunderstand
what you meant.

Agreed.

And B-0 is a performance nightmare waiting to happen.

I fully support B-2. I suspect there are some interesting question
about how it will interact with capture buffers however. I do not envy
you taking this on, I tried in the past and it hurt my brain.

Having looked at the code some, my changes don't affect any area with
capture buffers. Why do you think this would be different from other
lookaround assertions and their interactions with capture buffers?

Good luck!

Yves

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

From wagnerc@plebeian.com

On Tue, 30 Oct 2018 15​:53​:01 -0700, tonyc wrote​:

On Tue, 30 Oct 2018 14​:12​:05 -0700, jsailor@​techhouse.org wrote​:

Running

perl -le 'print if /h(?<!ssh)base/i'

produces the error

Variable length lookbehind not implemented in regex m/h(?<!ssh)base/
at -e line 1.

which is odd, because these work fine

perl -le 'print if /h(?<!xxh)base/i'
perl -le 'print if /h(?<!xxh)baxe/i'

works fine.

I believe it's the "ss", which can match ß (\xdf, LATIN SMALL LETTER
SHARP S), which makes it variable length.

Some ligatures can cause the same problem​:

$ perl -E 'qr/h(?<!ff)base/i'
Variable length lookbehind not implemented in regex m/h(?<!ff)base/ at
-e line 1.
$ perl -E 'qr/h(?<!fi)base/i'
Variable length lookbehind not implemented in regex m/h(?<!fi)base/ at
-e line 1.

If you only need to match ASCII case-insensitively you can use the /aa
flag​:

$ perl -E 'qr/h(?<!ssh)base/aai'

If you're perl is too old for the aa flag, I think you're stuck with
using character classes and no /i flag.

Tony

I think that a simple fix to this problem would be to require that the single unicode character be used to trigger variable length case folding. That way the appearance of a lower case ASCII sequence is always only interpreted as those ASCII characters.

e.g.
qr/(?<boss)man/i can only be qr/(?<[Bb][Oo][Ss][Ss])[Mm][Aa][Nn]/i

To get variable length case folding you must write e.g.
qr/(?<boß)man/i will be qr/(?<[Bb][Oo](?-i​:ß|s[s]))[Mm][Aa][Nn]/i

Putting use re "/aa"; is another option.

Thanks

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

From @khwilliamson

Top posted; too late, blead now has variable length lookbehind

On Mon, 18 Mar 2019 11​:59​:45 -0700, wagnerc@​plebeian.com wrote​:

On Tue, 30 Oct 2018 15​:53​:01 -0700, tonyc wrote​:

On Tue, 30 Oct 2018 14​:12​:05 -0700, jsailor@​techhouse.org wrote​:

Running

perl -le 'print if /h(?<!ssh)base/i'

produces the error

Variable length lookbehind not implemented in regex
m/h(?<!ssh)base/
at -e line 1.

which is odd, because these work fine

perl -le 'print if /h(?<!xxh)base/i'
perl -le 'print if /h(?<!xxh)baxe/i'

works fine.

I believe it's the "ss", which can match ß (\xdf, LATIN SMALL LETTER
SHARP S), which makes it variable length.

Some ligatures can cause the same problem​:

$ perl -E 'qr/h(?<!ff)base/i'
Variable length lookbehind not implemented in regex m/h(?<!ff)base/
at
-e line 1.
$ perl -E 'qr/h(?<!fi)base/i'
Variable length lookbehind not implemented in regex m/h(?<!fi)base/
at
-e line 1.

If you only need to match ASCII case-insensitively you can use the
/aa
flag​:

$ perl -E 'qr/h(?<!ssh)base/aai'

If you're perl is too old for the aa flag, I think you're stuck with
using character classes and no /i flag.

Tony

I think that a simple fix to this problem would be to require that the
single unicode character be used to trigger variable length case
folding. That way the appearance of a lower case ASCII sequence is
always only interpreted as those ASCII characters.

e.g.
qr/(?<boss)man/i can only be qr/(?<[Bb][Oo][Ss][Ss])[Mm][Aa][Nn]/i

To get variable length case folding you must write e.g.
qr/(?<boß)man/i will be qr/(?<[Bb][Oo](?-i​:ß|s[s]))[Mm][Aa][Nn]/i

Putting use re "/aa"; is another option.

Thanks

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

From @khwilliamson

Fixed by commit
2fe8bdb
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Mar 18, 2019

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented May 22, 2019

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.30.0, this and 160 other issues have been
resolved.

Perl 5.30.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.30.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented May 22, 2019

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant