Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious lstat calls in Perl_do_readline / glob #13663

Closed
p5pRT opened this issue Mar 14, 2014 · 5 comments
Closed

spurious lstat calls in Perl_do_readline / glob #13663

p5pRT opened this issue Mar 14, 2014 · 5 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 14, 2014

Migrated from rt.perl.org#121440 (status was 'resolved')

Searchable as RT121440$

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2014

From @craigberry

In pp_hot.c in Perl_do_readline, each and every line returned by a glob operation is run through the following check​:

  for (t1 = SvPVX_const(sv); *t1; t1++)
  if (strchr("$&*(){}[]'\";\\|?<>~`", *t1))
  break;
  if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) {
  (void)POPs; /* Unmatched wildcard? Chuck it... */
  continue;
  }

where sv contains the line returned by calling sv_gets on the file pointer returned by Perl_start_glob. The code is ancient; the basic logic hasn't changed since at least 5.000 twenty years ago.

Checking each and every character of each and every filename for its presence in a hard-coded list of pattern match characters is a fair amount of processing, but is also non-portable and, more seriously, causes many extra calls to the relatively slow lstat() in very common use cases on non-Unix platforms.

On Windows, any path with a backslash directory delimiter will trigger an unnecessary lstat(). On VMS, any native path with bracket directory delimiters ([]<>) or a semicolon version delimiter (;) will trigger an unnecessary lstat(). Also, on VMS, the dollar sign is a perfectly legal and extremely common filename character, but its presence will trigger an lstat().

I believe the intent of the code quoted above is to detect the case where the shell's glob returns the input pattern verbatim, indicating "no match." The lstat() appears to be a sanity check, as if to say, "not only does this file have pattern characters in its name, it also does not in fact exist."

If that is the case, it seems to me a better check would be to compare the first line returned from the glob (and the first line only) with the input pattern for an exact match and skip only that case. That wouldn't work if there are shells that normalize the input pattern in some way before returning it, but I don't know if that's ever the case.

As far as I know (which may not be very for, so please correct me) there would never be a reason to do this check on any but the first line of input (I *think* C<IoLINES(io) == 1> would do this but haven't checked). By the time you have two lines returned from a glob operation, you know you have at least two matches, so a check for the "no match" case no longer makes sense.

I don't know much about unixy glob implementations, so it's possible the foregoing analysis isn't correct, but I have stepped through and watched the extra lstat() calls, so there is a definite performance bug here as that's I/O to the file header that is completely unnecessary.

The following simple example shows that the mere presence of a version specification (;*) on the glob pattern causes a stat call on each filename returned. First with no debugging output​:

$ perl -we "print qq/$x\n/ while ($x = <*.com;*>);"
configure.com;1
extra_pods.com;1
perl_setup.com;1
vmspipe.com;1

and now with Perl debugger tracing enabled but also run under the VMS debugger to show where the internal stat routine gets called​:

$ dbgperl -Dt -we "print qq/$x\n/ while ($x = <*.com;*>);"
%DEBUG-W-DWNOT1PROC, the 1 process debugger cannot be run in DECwindows mode

  OpenVMS I64 Debug64 Version V8.4-001

%DEBUG-I-INITIAL, Language​: C, Module​: PERLMAIN
%DEBUG-I-NOTATMAIN, Type GO to reach MAIN program

DBG> go
break at routine PERLMAIN\main in THREAD 1
88751​: PL_use_safe_putenv = FALSE;
DBG> set image dbgperlshr
DBG> set module vms
DBG> set break Perl_flex_stat_int do (examine *fspec; go)
DBG> g

EXECUTING...

(-e​:0) enter
(-e​:0) nextstate
(-e​:1) enter
(-e​:1) const(PV("*.com;*"\0))
(-e​:1) gv(__ANON__​::)
(-e​:1) glob
break at routine VMS\Perl_flex_stat_int in THREAD 1
141001​: char *temp_fspec = NULL;
*VMS\Perl_flex_stat_int\fspec​: "*.com;*"
break at routine VMS\Perl_flex_stat_int in THREAD 1
141001​: char *temp_fspec = NULL;
*VMS\Perl_flex_stat_int\fspec​: "configure.com;1"
(-e​:1) gvsv(main​::x)
(-e​:1) sassign
(-e​:1) defined
(-e​:1) and
(-e​:1) pushmark
(-e​:1) gvsv(main​::x)
(-e​:1) const(PV("\n"\0))
(-e​:1) concat
(-e​:1) print
configure.com;1
(-e​:1) unstack
(-e​:1) const(PV("*.com;*"\0))
(-e​:1) gv(__ANON__​::)
(-e​:1) glob
break at routine VMS\Perl_flex_stat_int in THREAD 1
141001​: char *temp_fspec = NULL;
*VMS\Perl_flex_stat_int\fspec​: "extra_pods.com;1"
(-e​:1) gvsv(main​::x)
(-e​:1) sassign
(-e​:1) defined
(-e​:1) and
(-e​:1) pushmark
(-e​:1) gvsv(main​::x)
(-e​:1) const(PV("\n"\0))
(-e​:1) concat
(-e​:1) print
extra_pods.com;1
(-e​:1) unstack
(-e​:1) const(PV("*.com;*"\0))
(-e​:1) gv(__ANON__​::)
(-e​:1) glob
break at routine VMS\Perl_flex_stat_int in THREAD 1
141001​: char *temp_fspec = NULL;
*VMS\Perl_flex_stat_int\fspec​: "perl_setup.com;1"
(-e​:1) gvsv(main​::x)
(-e​:1) sassign
(-e​:1) defined
(-e​:1) and
(-e​:1) pushmark
(-e​:1) gvsv(main​::x)
(-e​:1) const(PV("\n"\0))
(-e​:1) concat
(-e​:1) print
perl_setup.com;1
(-e​:1) unstack
(-e​:1) const(PV("*.com;*"\0))
(-e​:1) gv(__ANON__​::)
(-e​:1) glob
break at routine VMS\Perl_flex_stat_int in THREAD 1
141001​: char *temp_fspec = NULL;
*VMS\Perl_flex_stat_int\fspec​: "vmspipe.com;1"
(-e​:1) gvsv(main​::x)
(-e​:1) sassign
(-e​:1) defined
(-e​:1) and
(-e​:1) pushmark
(-e​:1) gvsv(main​::x)
(-e​:1) const(PV("\n"\0))
(-e​:1) concat
(-e​:1) print
vmspipe.com;1
(-e​:1) unstack
(-e​:1) const(PV("*.com;*"\0))
(-e​:1) gv(__ANON__​::)
(-e​:1) glob
(-e​:1) gvsv(main​::x)
(-e​:1) sassign
(-e​:1) defined
(-e​:1) and
(-e​:1) leave
(-e​:1) leave
%DEBUG-I-EXITSTATUS, is '%SYSTEM-S-NORMAL, normal successful completion'
DBG> Exit

Thanks go to Hein van den Heuvel for making me aware of this problem.

$ perl -"V"
Summary of my perl5 (revision 5 version 19 subversion 10) configuration​:
  Snapshot of​: f53a6e0
  Platform​:
  osname=VMS, osvers=V8.4, archname=VMS_IA64-thread-multi
  uname='VMS alma V8.4 HP rx2600 (1.50GHz/6.0MB)'
  config_args='-"Dusedevel" -"DDEBUGGING" -"Dusethreads" -"Dusevmsdebug" -"des"'
  hint=none, useposix=false, d_sigaction=define
  useithreads=define, usemultiplicity=define
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=undef, bincompat5005=undef
  Compiler​:
  cc='CC/DECC', ccflags ='/Include=[]/Standard=Relaxed_ANSI/Prefix=All/Obj=.obj /NOANSI_ALIAS/float=ieee/ieee=denorm/NAMES=(SHORTENED)/Define=_USE
_STD_STAT=1',
  optimize='/List/Debug/NoOpt',
  cppflags='undef'
  ccversion='70390020', gccversion='', gccosandvers='undef'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='Link/nodebug', ldflags ='/Debug/Trace/Map'
  libpth=/sys$share /sys$library
  libs=
  perllibs=
  libc=(DECCRTL), so=exe, useshrplib=true, libperl=undef
  gnulibc_version='undef'
  Dynamic Linking​:
  dlsrc=dl_vms.xs, dlext=exe, d_dlsymun=undef, ccdlflags=''
  cccdlflags='', lddlflags='/Share'

Characteristics of this PERLSHR image​:
  Compile-time options​: DEBUGGING HAS_TIMES HAVE_INTERP_INTERN MULTIPLICITY
  PERLIO_LAYERS PERL_DONT_CREATE_GVSV
  PERL_EXTERNAL_GLOB PERL_HASH_FUNC_ONE_AT_A_TIME_HARD
  PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP
  PERL_NEW_COPY_ON_WRITE PERL_PRESERVE_IVUV
  PERL_TRACK_MEMPOOL USE_IEEE USE_ITHREADS
  USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE
  USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_PERLIO
  USE_PERL_ATOF USE_REENTRANT_API VMS_DO_SOCKETS
  VMS_SHORTEN_LONG_SYMBOLS
  Built under VMS
  Compiled at Mar 12 2014 09​:05​:41
  %ENV​:
  PERLSHR="perl_root​:[000000]perlshr.exe"
  PERL_ROOT="DSA0​:[craig.blead.]"
  @​INC​:
  /perl_root/lib/site_perl/VMS_IA64-thread-multi
  /perl_root/lib/site_perl
  /perl_root/lib/VMS_IA64-thread-multi/5_19_10
  /perl_root/lib
  .

________________________________________
Craig A. Berry
mailto​:craigberry@​mac.com

"... getting out of a sonnet is much more
difficult than getting in."
  Brad Leithauser

@p5pRT
Copy link
Author

p5pRT commented Mar 17, 2014

From @craigberry

On 3/14/14, 3​:51 PM, Craig A . Berry wrote​:

# New Ticket Created by Craig A. Berry
# Please include the string​: [perl #121440]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121440 >

In pp_hot.c in Perl_do_readline, each and every line returned by a
glob operation is run through the following check​:

         for \(t1 = SvPVX\_const\(sv\); \*t1; t1\+\+\)
             if \(strchr\("$&\*\(\)\{\}\[\]'\\";\\\\|?\<>~\`"\, \*t1\)\)
                     break;
         if \(\*t1 && PerlLIO\_lstat\(SvPVX\_const\(sv\)\, &PL\_statbuf\) \< 0\) \{
             \(void\)POPs;             /\* Unmatched wildcard?  Chuck it\.\.\. \*/
             continue;
         \}

where sv contains the line returned by calling sv_gets on the file
pointer returned by Perl_start_glob. The code is ancient; the basic
logic hasn't changed since at least 5.000 twenty years ago.

Checking each and every character of each and every filename for its
presence in a hard-coded list of pattern match characters is a fair
amount of processing, but is also non-portable and, more seriously,
causes many extra calls to the relatively slow lstat() in very common
use cases on non-Unix platforms.

What I forgot about when I posted is that this code only kicks in when
PERL_EXTERNAL_GLOB is defined, which on most platforms means only for
miniperl but on VMS means always.

As far as I know (which may not be very for, so please correct me)
there would never be a reason to do this check on any but the first
line of input (I *think* C<IoLINES(io) == 1> would do this but
haven't checked). By the time you have two lines returned from a
glob operation, you know you have at least two matches, so a check
for the "no match" case no longer makes sense.

Answering my own question, I believe the expectation is that multiple
glob patterns can be processed in sequence, which means an unexpanded
wildcard could be found anywhere in the input, such is the *.flirble in
the following​:

$ echo *.lst *.flirble *.SH
mkppport.lst utils.lst *.flirble Makefile.SH Policy_sh.SH cflags.SH
config_h.SH makedepend.SH metaconfig.SH myconfig.SH runtests.SH

So this can't be fixed in the universal fashion I was hoping. The patch
below fixes it on VMS by only checking for wildcard characters that are
meaningful in the VMS glob implementation, thus saving the lstat() call
on all the perfectly normal filenames having brackets, semicolons, and
dollar signs.

This makes a glob of the Perl source tree 60% faster on first iteration
and 80% faster on subsequent iterations (presumably because caching of
directories doesn't get swamped out by all the extra lstat() calls). It
doesn't do anything for paths containing backslashes on Windows, which
may be of interest to some folks even though it only affects miniperl.

So I'll push this soonish unless someone has a better suggestion.

Inline Patch
--- pp_hot.c.orig	2014-03-14 21:40:45 -0500
+++ pp_hot.c	2014-03-15 18:00:37 -0500
@@ -1698,7 +1698,11 @@ Perl_do_readline(pTHX)
  		}
  	    }
  	    for (t1 = SvPVX_const(sv); *t1; t1++)
+#ifdef __VMS
+		if (strchr("*%?", *t1))
+#else
  		if (strchr("$&*(){}[]'\";\\|?<>~`", *t1))
+#endif
  			break;
  	    if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) {
  		(void)POPs;		/* Unmatched wildcard?  Chuck it... */

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2017

From @jkeenan

On Mon, 17 Mar 2014 12​:33​:58 GMT, craigberry wrote​:

On 3/14/14, 3​:51 PM, Craig A . Berry wrote​:

# New Ticket Created by Craig A. Berry
# Please include the string​: [perl #121440]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=121440 >

In pp_hot.c in Perl_do_readline, each and every line returned by a
glob operation is run through the following check​:

for (t1 = SvPVX_const(sv); *t1; t1++)
if (strchr("$&*(){}[]'\";\\|?<>~`", *t1))
break;
if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) < 0) {
(void)POPs; /* Unmatched wildcard? Chuck it... */
continue;
}

where sv contains the line returned by calling sv_gets on the file
pointer returned by Perl_start_glob. The code is ancient; the basic
logic hasn't changed since at least 5.000 twenty years ago.

Checking each and every character of each and every filename for its
presence in a hard-coded list of pattern match characters is a fair
amount of processing, but is also non-portable and, more seriously,
causes many extra calls to the relatively slow lstat() in very common
use cases on non-Unix platforms.

What I forgot about when I posted is that this code only kicks in when
PERL_EXTERNAL_GLOB is defined, which on most platforms means only for
miniperl but on VMS means always.

As far as I know (which may not be very for, so please correct me)
there would never be a reason to do this check on any but the first
line of input (I *think* C<IoLINES(io) == 1> would do this but
haven't checked). By the time you have two lines returned from a
glob operation, you know you have at least two matches, so a check
for the "no match" case no longer makes sense.

Answering my own question, I believe the expectation is that multiple
glob patterns can be processed in sequence, which means an unexpanded
wildcard could be found anywhere in the input, such is the *.flirble
in
the following​:

$ echo *.lst *.flirble *.SH
mkppport.lst utils.lst *.flirble Makefile.SH Policy_sh.SH cflags.SH
config_h.SH makedepend.SH metaconfig.SH myconfig.SH runtests.SH

So this can't be fixed in the universal fashion I was hoping. The
patch
below fixes it on VMS by only checking for wildcard characters that
are
meaningful in the VMS glob implementation, thus saving the lstat()
call
on all the perfectly normal filenames having brackets, semicolons, and
dollar signs.

This makes a glob of the Perl source tree 60% faster on first
iteration
and 80% faster on subsequent iterations (presumably because caching of
directories doesn't get swamped out by all the extra lstat() calls).
It
doesn't do anything for paths containing backslashes on Windows, which
may be of interest to some folks even though it only affects miniperl.

So I'll push this soonish unless someone has a better suggestion.

--- pp_hot.c.orig 2014-03-14 21​:40​:45 -0500
+++ pp_hot.c 2014-03-15 18​:00​:37 -0500
@​@​ -1698,7 +1698,11 @​@​ Perl_do_readline(pTHX)
}
}
for (t1 = SvPVX_const(sv); *t1; t1++)
+#ifdef __VMS
+ if (strchr("*%?", *t1))
+#else
if (strchr("$&*(){}[]'\";\\|?<>~`", *t1))
+#endif
break;
if (*t1 && PerlLIO_lstat(SvPVX_const(sv), &PL_statbuf) <
0) {
(void)POPs; /* Unmatched wildcard? Chuck
it... */

You did commit that code, but the ticket was never closed​:

#####
commit b51c3e7
Author​: Craig A. Berry <craigberry@​mac.com>
AuthorDate​: Fri Mar 21 19​:29​:38 2014 -0500
Commit​: Craig A. Berry <craigberry@​mac.com>
CommitDate​: Fri Mar 21 20​:55​:03 2014 -0500
#####

Closing now.
--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT p5pRT closed this as completed Sep 30, 2017
@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2017

@jkeenan - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant