Skip Menu |
Report information
Id: 121440
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: craigberry <craigberry [at] mac.com>
Cc:
AdminCc:

Operating System: vms
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Subject: spurious lstat calls in Perl_do_readline / glob
Date: Fri, 14 Mar 2014 15:51:24 -0500
To: perlbug [...] perl.org
From: "Craig A. Berry" <craigberry [...] mac.com>
In pp_hot.c in Perl_do_readline, each and every line returned by a glob operation is run through the following check: for (t1 = SvPVX_const(sv); *t1; t1++) if (strchr("$&*(){}[]'\";\\|?<>~`", *t1)) break; if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) { (void)POPs; /* Unmatched wildcard? Chuck it... */ continue; } where sv contains the line returned by calling sv_gets on the file pointer returned by Perl_start_glob. The code is ancient; the basic logic hasn't changed since at least 5.000 twenty years ago. Checking each and every character of each and every filename for its presence in a hard-coded list of pattern match characters is a fair amount of processing, but is also non-portable and, more seriously, causes many extra calls to the relatively slow lstat() in very common use cases on non-Unix platforms. On Windows, any path with a backslash directory delimiter will trigger an unnecessary lstat(). On VMS, any native path with bracket directory delimiters ([]<>) or a semicolon version delimiter (;) will trigger an unnecessary lstat(). Also, on VMS, the dollar sign is a perfectly legal and extremely common filename character, but its presence will trigger an lstat(). I believe the intent of the code quoted above is to detect the case where the shell's glob returns the input pattern verbatim, indicating "no match." The lstat() appears to be a sanity check, as if to say, "not only does this file have pattern characters in its name, it also does not in fact exist." If that is the case, it seems to me a better check would be to compare the first line returned from the glob (and the first line only) with the input pattern for an exact match and skip only that case. That wouldn't work if there are shells that normalize the input pattern in some way before returning it, but I don't know if that's ever the case. As far as I know (which may not be very for, so please correct me) there would never be a reason to do this check on any but the first line of input (I *think* C<IoLINES(io) == 1> would do this but haven't checked). By the time you have two lines returned from a glob operation, you know you have at least two matches, so a check for the "no match" case no longer makes sense. I don't know much about unixy glob implementations, so it's possible the foregoing analysis isn't correct, but I have stepped through and watched the extra lstat() calls, so there is a definite performance bug here as that's I/O to the file header that is completely unnecessary. The following simple example shows that the mere presence of a version specification (;*) on the glob pattern causes a stat call on each filename returned. First with no debugging output: $ perl -we "print qq/$x\n/ while ($x = <*.com;*>);" configure.com;1 extra_pods.com;1 perl_setup.com;1 vmspipe.com;1 and now with Perl debugger tracing enabled but also run under the VMS debugger to show where the internal stat routine gets called: $ dbgperl -Dt -we "print qq/$x\n/ while ($x = <*.com;*>);" %DEBUG-W-DWNOT1PROC, the 1 process debugger cannot be run in DECwindows mode OpenVMS I64 Debug64 Version V8.4-001 %DEBUG-I-INITIAL, Language: C, Module: PERLMAIN %DEBUG-I-NOTATMAIN, Type GO to reach MAIN program Show quoted text
DBG> go
break at routine PERLMAIN\main in THREAD 1 88751: PL_use_safe_putenv = FALSE; Show quoted text
DBG> set image dbgperlshr DBG> set module vms DBG> set break Perl_flex_stat_int do (examine *fspec; go) DBG> g
EXECUTING... (-e:0) enter (-e:0) nextstate (-e:1) enter (-e:1) const(PV("*.com;*"\0)) (-e:1) gv(__ANON__::) (-e:1) glob break at routine VMS\Perl_flex_stat_int in THREAD 1 141001: char *temp_fspec = NULL; *VMS\Perl_flex_stat_int\fspec: "*.com;*" break at routine VMS\Perl_flex_stat_int in THREAD 1 141001: char *temp_fspec = NULL; *VMS\Perl_flex_stat_int\fspec: "configure.com;1" (-e:1) gvsv(main::x) (-e:1) sassign (-e:1) defined (-e:1) and (-e:1) pushmark (-e:1) gvsv(main::x) (-e:1) const(PV("\n"\0)) (-e:1) concat (-e:1) print configure.com;1 (-e:1) unstack (-e:1) const(PV("*.com;*"\0)) (-e:1) gv(__ANON__::) (-e:1) glob break at routine VMS\Perl_flex_stat_int in THREAD 1 141001: char *temp_fspec = NULL; *VMS\Perl_flex_stat_int\fspec: "extra_pods.com;1" (-e:1) gvsv(main::x) (-e:1) sassign (-e:1) defined (-e:1) and (-e:1) pushmark (-e:1) gvsv(main::x) (-e:1) const(PV("\n"\0)) (-e:1) concat (-e:1) print extra_pods.com;1 (-e:1) unstack (-e:1) const(PV("*.com;*"\0)) (-e:1) gv(__ANON__::) (-e:1) glob break at routine VMS\Perl_flex_stat_int in THREAD 1 141001: char *temp_fspec = NULL; *VMS\Perl_flex_stat_int\fspec: "perl_setup.com;1" (-e:1) gvsv(main::x) (-e:1) sassign (-e:1) defined (-e:1) and (-e:1) pushmark (-e:1) gvsv(main::x) (-e:1) const(PV("\n"\0)) (-e:1) concat (-e:1) print perl_setup.com;1 (-e:1) unstack (-e:1) const(PV("*.com;*"\0)) (-e:1) gv(__ANON__::) (-e:1) glob break at routine VMS\Perl_flex_stat_int in THREAD 1 141001: char *temp_fspec = NULL; *VMS\Perl_flex_stat_int\fspec: "vmspipe.com;1" (-e:1) gvsv(main::x) (-e:1) sassign (-e:1) defined (-e:1) and (-e:1) pushmark (-e:1) gvsv(main::x) (-e:1) const(PV("\n"\0)) (-e:1) concat (-e:1) print vmspipe.com;1 (-e:1) unstack (-e:1) const(PV("*.com;*"\0)) (-e:1) gv(__ANON__::) (-e:1) glob (-e:1) gvsv(main::x) (-e:1) sassign (-e:1) defined (-e:1) and (-e:1) leave (-e:1) leave %DEBUG-I-EXITSTATUS, is '%SYSTEM-S-NORMAL, normal successful completion' Show quoted text
DBG> Exit
Thanks go to Hein van den Heuvel for making me aware of this problem. $ perl -"V" Summary of my perl5 (revision 5 version 19 subversion 10) configuration: Snapshot of: f53a6e0ea91f342c0645d72c1c6cfbd044b1794b Platform: osname=VMS, osvers=V8.4, archname=VMS_IA64-thread-multi uname='VMS alma V8.4 HP rx2600 (1.50GHz/6.0MB)' config_args='-"Dusedevel" -"DDEBUGGING" -"Dusethreads" -"Dusevmsdebug" -"des"' hint=none, useposix=false, d_sigaction=define useithreads=define, usemultiplicity=define use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=undef, bincompat5005=undef Compiler: cc='CC/DECC', ccflags ='/Include=[]/Standard=Relaxed_ANSI/Prefix=All/Obj=.obj /NOANSI_ALIAS/float=ieee/ieee=denorm/NAMES=(SHORTENED)/Define=_USE _STD_STAT=1', optimize='/List/Debug/NoOpt', cppflags='undef' ccversion='70390020', gccversion='', gccosandvers='undef' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='Link/nodebug', ldflags ='/Debug/Trace/Map' libpth=/sys$share /sys$library libs= perllibs= libc=(DECCRTL), so=exe, useshrplib=true, libperl=undef gnulibc_version='undef' Dynamic Linking: dlsrc=dl_vms.xs, dlext=exe, d_dlsymun=undef, ccdlflags='' cccdlflags='', lddlflags='/Share' Characteristics of this PERLSHR image: Compile-time options: DEBUGGING HAS_TIMES HAVE_INTERP_INTERN MULTIPLICITY PERLIO_LAYERS PERL_DONT_CREATE_GVSV PERL_EXTERNAL_GLOB PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV PERL_TRACK_MEMPOOL USE_IEEE USE_ITHREADS USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API VMS_DO_SOCKETS VMS_SHORTEN_LONG_SYMBOLS Built under VMS Compiled at Mar 12 2014 09:05:41 %ENV: PERLSHR="perl_root:[000000]perlshr.exe" PERL_ROOT="DSA0:[craig.blead.]" @INC: /perl_root/lib/site_perl/VMS_IA64-thread-multi /perl_root/lib/site_perl /perl_root/lib/VMS_IA64-thread-multi/5_19_10 /perl_root/lib . Show quoted text
________________________________________ Craig A. Berry mailto:craigberry@mac.com "... getting out of a sonnet is much more difficult than getting in." Brad Leithauser
From: "Craig A. Berry" <craigberry [...] mac.com>
To: perl5-porters [...] perl.org
Date: Mon, 17 Mar 2014 07:33:24 -0500
Subject: Re: [perl #121440] spurious lstat calls in Perl_do_readline / glob
CC: perlbug-followup [...] perl.org
Download (untitled) / with headers
text/plain 3.3k
On 3/14/14, 3:51 PM, Craig A . Berry wrote: Show quoted text
> # New Ticket Created by Craig A. Berry > # Please include the string: [perl #121440] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=121440 > > > > In pp_hot.c in Perl_do_readline, each and every line returned by a > glob operation is run through the following check: > > for (t1 = SvPVX_const(sv); *t1; t1++) > if (strchr("$&*(){}[]'\";\\|?<>~`", *t1)) > break; > if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) { > (void)POPs; /* Unmatched wildcard? Chuck it... */ > continue; > } > > where sv contains the line returned by calling sv_gets on the file > pointer returned by Perl_start_glob. The code is ancient; the basic > logic hasn't changed since at least 5.000 twenty years ago. > > Checking each and every character of each and every filename for its > presence in a hard-coded list of pattern match characters is a fair > amount of processing, but is also non-portable and, more seriously, > causes many extra calls to the relatively slow lstat() in very common > use cases on non-Unix platforms.
What I forgot about when I posted is that this code only kicks in when PERL_EXTERNAL_GLOB is defined, which on most platforms means only for miniperl but on VMS means always. Show quoted text
> As far as I know (which may not be very for, so please correct me) > there would never be a reason to do this check on any but the first > line of input (I *think* C<IoLINES(io) == 1> would do this but > haven't checked). By the time you have two lines returned from a > glob operation, you know you have at least two matches, so a check > for the "no match" case no longer makes sense.
Answering my own question, I believe the expectation is that multiple glob patterns can be processed in sequence, which means an unexpanded wildcard could be found anywhere in the input, such is the *.flirble in the following: $ echo *.lst *.flirble *.SH mkppport.lst utils.lst *.flirble Makefile.SH Policy_sh.SH cflags.SH config_h.SH makedepend.SH metaconfig.SH myconfig.SH runtests.SH So this can't be fixed in the universal fashion I was hoping. The patch below fixes it on VMS by only checking for wildcard characters that are meaningful in the VMS glob implementation, thus saving the lstat() call on all the perfectly normal filenames having brackets, semicolons, and dollar signs. This makes a glob of the Perl source tree 60% faster on first iteration and 80% faster on subsequent iterations (presumably because caching of directories doesn't get swamped out by all the extra lstat() calls). It doesn't do anything for paths containing backslashes on Windows, which may be of interest to some folks even though it only affects miniperl. So I'll push this soonish unless someone has a better suggestion. --- pp_hot.c.orig 2014-03-14 21:40:45 -0500 +++ pp_hot.c 2014-03-15 18:00:37 -0500 @@ -1698,7 +1698,11 @@ Perl_do_readline(pTHX) } } for (t1 = SvPVX_const(sv); *t1; t1++) +#ifdef __VMS + if (strchr("*%?", *t1)) +#else if (strchr("$&*(){}[]'\";\\|?<>~`", *t1)) +#endif break; if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) { (void)POPs; /* Unmatched wildcard? Chuck it... */
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 3.8k
On Mon, 17 Mar 2014 12:33:58 GMT, craigberry wrote: Show quoted text
> On 3/14/14, 3:51 PM, Craig A . Berry wrote:
> > # New Ticket Created by Craig A. Berry > > # Please include the string: [perl #121440] > > # in the subject line of all future correspondence about this issue. > > # <URL: https://rt.perl.org/Ticket/Display.html?id=121440 > > > > > > > In pp_hot.c in Perl_do_readline, each and every line returned by a > > glob operation is run through the following check: > > > > for (t1 = SvPVX_const(sv); *t1; t1++) > > if (strchr("$&*(){}[]'\";\\|?<>~`", *t1)) > > break; > > if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) { > > (void)POPs; /* Unmatched wildcard? Chuck it... */ > > continue; > > } > > > > where sv contains the line returned by calling sv_gets on the file > > pointer returned by Perl_start_glob. The code is ancient; the basic > > logic hasn't changed since at least 5.000 twenty years ago. > > > > Checking each and every character of each and every filename for its > > presence in a hard-coded list of pattern match characters is a fair > > amount of processing, but is also non-portable and, more seriously, > > causes many extra calls to the relatively slow lstat() in very common > > use cases on non-Unix platforms.
> > What I forgot about when I posted is that this code only kicks in when > PERL_EXTERNAL_GLOB is defined, which on most platforms means only for > miniperl but on VMS means always. >
> > As far as I know (which may not be very for, so please correct me) > > there would never be a reason to do this check on any but the first > > line of input (I *think* C<IoLINES(io) == 1> would do this but > > haven't checked). By the time you have two lines returned from a > > glob operation, you know you have at least two matches, so a check > > for the "no match" case no longer makes sense.
> > Answering my own question, I believe the expectation is that multiple > glob patterns can be processed in sequence, which means an unexpanded > wildcard could be found anywhere in the input, such is the *.flirble > in > the following: > > $ echo *.lst *.flirble *.SH > mkppport.lst utils.lst *.flirble Makefile.SH Policy_sh.SH cflags.SH > config_h.SH makedepend.SH metaconfig.SH myconfig.SH runtests.SH > > So this can't be fixed in the universal fashion I was hoping. The > patch > below fixes it on VMS by only checking for wildcard characters that > are > meaningful in the VMS glob implementation, thus saving the lstat() > call > on all the perfectly normal filenames having brackets, semicolons, and > dollar signs. > > This makes a glob of the Perl source tree 60% faster on first > iteration > and 80% faster on subsequent iterations (presumably because caching of > directories doesn't get swamped out by all the extra lstat() calls). > It > doesn't do anything for paths containing backslashes on Windows, which > may be of interest to some folks even though it only affects miniperl. > > So I'll push this soonish unless someone has a better suggestion. > > --- pp_hot.c.orig 2014-03-14 21:40:45 -0500 > +++ pp_hot.c 2014-03-15 18:00:37 -0500 > @@ -1698,7 +1698,11 @@ Perl_do_readline(pTHX) > } > } > for (t1 = SvPVX_const(sv); *t1; t1++) > +#ifdef __VMS > + if (strchr("*%?", *t1)) > +#else > if (strchr("$&*(){}[]'\";\\|?<>~`", *t1)) > +#endif > break; > if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < > 0) { > (void)POPs; /* Unmatched wildcard? Chuck > it... */
You did commit that code, but the ticket was never closed: ##### commit b51c3e77dbb7e510319342a73163b3fbb59baf5a Author: Craig A. Berry <craigberry@mac.com> AuthorDate: Fri Mar 21 19:29:38 2014 -0500 Commit: Craig A. Berry <craigberry@mac.com> CommitDate: Fri Mar 21 20:55:03 2014 -0500 ##### Closing now. -- James E Keenan (jkeenan@cpan.org)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org