Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FW: Call Nr. 6583771 (!!! HELP !!!) #675

Closed
p5pRT opened this issue Oct 4, 1999 · 17 comments
Closed

FW: Call Nr. 6583771 (!!! HELP !!!) #675

p5pRT opened this issue Oct 4, 1999 · 17 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 4, 1999

Migrated from rt.perl.org#1564 (status was 'resolved')

Searchable as RT1564$

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

From Richard.Hensgens@nl.origin-it.com

L.S.,

We have encountered a very interesting problem on which you are really our
last resort​:

Exiting a child in a forking server (example on page 194 of 'Advanced Perl
Programming' O'Reilly) seems to clean-up the server socket of the parent on
newer levels of Solaris. The parent exits with a 'EBADF (Bad file number)'
after having served one client request.

We have tried almost everything within our power, e.g.​:

* compiling Perl on a working OS level and copying the binaries to the
non-working OS level,
* compiling the current development version (5.005_61),
* different GNU compilers (2.8.1 and 2.95.1),
* SUN Workshop Compiler C/C++ 4.2,
* hacking in 'config.sh' (e.g. 'usevfork=false/true',
multithreaded/non-multithreaded).

Nothing works out.

After issuing a bug report, SUN responded with the following​:
<<FW​: Bug ID# 4146098>>
but this is to low-level for us to understand what's really going on. The
troubling patch from SUN seems to be 105210-17 or above.

In more understandable language they claimed that older versions of Solaris
had a bug, which is fixed in newer releases and that Perl has probably been
working around that bug. Now the bug is removed from the OS , Perl is still
working around, but this time unsuccesfully.

Does this make sense to you ?

Can you help ???

P.S.​: Below you can find all mail communications with SUN.
  If you need additional information, please let us now.

Met vriendelijke groet/Kind regards,
Richard Hensgens
ORIGIN B.V. - Managed Services - Distributed Systems
Building VA-171, E-Mail​: Richard.Hensgens@​nl.origin-it.com
Phone​: (+31​:4027)87097, Fax​: (+31​:4027)83962

The unix guru's view on sex​:
# unzip; strip; touch; finger; mount; fsck; more; yes; umount; sleep

-----Original Message-----
From​: Zuijdwijk, Pieter
Sent​: Thursday, September 30, 1999 6​:44 PM
To​: 'dispatch@​holland.sun.com'
Cc​: Zuijdwijk, Pieter; Hensgens, Richard
Subject​: Call Nr. 6583771

Hereby the "truss -aef" output of 2 SUN systems running 2 different OS
levels​:

OK files​: SunOS ... 5.6 Generic_105181-06 sun4u sparc SUNW,Ultra-4
NOK files​: SunOS ... 5.6 Generic_105181-15 sun4u sparc
SUNW,Ultra-Enterprise

<<Client.truss.out.NOK>> <<Client.truss.out.OK>> <<Client.pl>>
<<Server.pl>> <<Server.truss.out.NOK>> <<Server.truss.out.OK>>
As you can see we have also problems on 5.6 Generic_105181-15 on
Ultra-Enterprise 3000. Not a specific Solaris 7 issue after all.

Thanks in advance.

Pieter Zuijdwijk
Origin TIS-DS-UNIX-SUN
Groenewoudseweg 1
5621 BA Eindhoven, The Netherlands
Building VA-169
Phone +31 (0)40 27 89605
Fax +31 (0)40 27 89362

-----Original Message-----
From​: Hensgens, Richard
Sent​: Tuesday, September 28, 1999 1​:09 PM
To​: Zuijdwijk, Pieter
Subject​: Bug Solaris 2.7

Pieter,

Before we start downgrading the SUN box, maybe first a bug report to SUN ?

Regular examples from the O'Reilly Perl books work differently on Solaris
2.6 and Solaris 2.7 with exactly the same Perl versions (5.005_03)​:

Server.pl​:
#!/usr/bin/perl

use IO​::Socket;

$SIG{CHLD} = sub { wait() };

$Sock = new IO​::Socket​::INET( LocalPort => 9000, Proto => 'tcp', Listen =>
SOMAXCONN, Reuse => 1 ) or die "SOCKET() error [$!]";

while ( $NewSock = $Sock->accept() )
{
  $Pid = fork();

  if ( $Pid == 0 )
  {
  while ( defined( $Buffer = <$NewSock> ) )
  {
  print( $Buffer );
  }

  exit( 0 );
  }
}

close( $Sock );

exit( 0 );

Client.pl​:
#!/usr/bin/perl

use IO​::Socket;

$Sock = new IO​::Socket​::INET( PeerAddr => 'tsesun01', PeerPort => 9000,
Proto => 'tcp' ) or die "SOCKET() error [$!]";

foreach ( 1..10 )
{
  print( $Sock "Msg $_​: How are you ?\n" );
}

close( $Sock );

exit( 0 );

Output on Solaris 2.6​:

nl1sahd1​:root> ./Server.pl
nl1sahd1​:root> jobs
[1] + Running ./Server.pl &
nl1sahd1​:root> ./Client.pl
nl1sahd1​:root> Msg 1​: How are you ?
Msg 2​: How are you ?
Msg 3​: How are you ?
Msg 4​: How are you ?
Msg 5​: How are you ?
Msg 6​: How are you ?
Msg 7​: How are you ?
Msg 8​: How are you ?
Msg 9​: How are you ?
Msg 10​: How are you ?

nl1sahd1​:root> jobs
[1] + Running ./Server.pl &

Server serves as many requests as it should be.

Output on Solaris 2.7​:

tsesun01​:root> ./Server.pl &
[1] 12331
tsesun01​:root> jobs
[1] + Running ./Server.pl &
tsesun01​:root> ./Client.pl
Msg 1​: How are you ?
Msg 2​: How are you ?
tsesun01​:root> Msg 3​: How are you ?
Msg 4​: How are you ?
Msg 5​: How are you ?
Msg 6​: How are you ?
Msg 7​: How are you ?
Msg 8​: How are you ?
Msg 9​: How are you ?
Msg 10​: How are you ?

[1] + Done ./Server.pl &
tsesun01​:root> jobs

Server only serves one request and ends !!!!!

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

From Richard.Hensgens@nl.origin-it.com

Message RFC822:
Message-ID: 986AEA765305D311AA7B0008C75D97AFBB8A23@NLEHX020.origimail.origin-it.com
From: "Hensgens, Richard" Richard.Hensgens@nl.origin-it.com
To: "Hensgens, Richard" Richard.Hensgens@nl.origin-it.com
Subject: FW: Bug ID# 4146098
Date: Mon, 4 Oct 1999 17:19:14 +0200
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2448.0)
Content-Type: text/plain;
charset="iso-8859-1"

Bug Id: 4146098
Product: sunos
Category: network
Subcategory: socket
Bug/Rfe/Eou: bug
State: integrated
Development Status: INT
Synopsis: connect() and accept() can RESTART instead of returning EINTR
Keywords: esc#514623
Severity: 2
Severity Impact: 1
Severity Functionality: 0
Priority: 2
Description:
When SA_RESTART is passed to sigaction(), connect() and accept() restart
instead of returning with errno EINTR.

CONNECT(2) SYSTEM CALLS CONNECT(2)

EINTR               The connection attempt  was  interrupted
                     before  any data arrived by the delivery
                     of a signal.

Sun Release 4.1 Last change: 21 January 1990 3

============================================================================

SVR4 example sigaction() is needed to set SA_RESTART.
c_test_sys5
is selected that will not respond to connect.
kill -ALARM SVR4 example sigaction() is needed to set
SA_RESTART.

is sent to process while waiting.

/*
cc -o c_test_sys5 c_test_sys5.c -lsocket -lnsl

 Usage: c_test_sys5 <hostname> <portNO>

*/
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <netinet/in.h>
#include <signal.h>

void handler(sig)
int sig;
{
printf("SIGNAL CATCHED\n");
}

main(argc, argv)
int argc;
char **argv;
{

    struct sockaddr_in      ser;
    struct hostent          *serhost;
    int             sock;
    int             n;
    char            buf[256];
    struct sigaction sa;

    if(argc != 3){
            fprintf(stderr, "Usage: client <hostname-of-server>

\n");
exit(1);
}

    sa.sa_flags = SA_RESTART;
    sa.sa_handler = handler;
    sigemptyset(&sa.sa_mask);
    sigaction(SIGALRM, &sa, NULL);

    serhost = gethostbyname(argv[1]);
    if(serhost == NULL){
            fprintf(stderr, "bad hostname\n");
            exit(1);
    }

    memset((char *)&ser, 0, sizeof(ser));
    ser.sin_family = AF_INET;
    ser.sin_port = atoi(argv[2]);
    memcpy(&ser.sin_addr, serhost->h_addr, serhost->h_length);


    sock = socket(AF_INET, SOCK_STREAM, 0);
    if(sock == -1){
            fprintf(stderr, "socket failed\n");
            exit(1);
    }


    if(connect(sock, (struct sockaddr *)&ser, sizeof(ser)) == -1){
            perror("CONNECT");
            exit(1);
    }


    while(1){
                    n = read(sock, buf, sizeof(buf));
            if(n == 0)
                    break;
            if(n < 0){
                    fprintf(stderr, "file read error\n");
                    exit(1);
            }
            write(1, buf, n);
    }

    close(sock);
    exit(0);

}

Justification:
This is the root cause of Escalation # 514623
bug# 4132657, Customer needs a patch for 5.6.
Work around:

Suggested fix:

Diffs are shown below for sparc and x86 (diffs are identical for
sparc and sparcv9). The entire set of files changed are:

    usr/src/lib/libc/i386/sys/_so_accept.s
    usr/src/lib/libc/i386/sys/_so_connect.s
    usr/src/lib/libc/sparc/sys/_so_accept.s
    usr/src/lib/libc/sparc/sys/_so_connect.s
    usr/src/lib/libc/sparcv9/sys/_so_accept.s
    usr/src/lib/libc/sparcv9/sys/_so_connect.s

note _cerror maps() ERESTART to EINTR

####### usr/src/lib/libc/sparc/sys ######

% diff -c _so_connect.s.1.2 _so_connect.s
*** _so_connect.s.1.2 Thu May 22 14:38:48 1997
--- _so_connect.s Fri Jun 5 08:28:00 1998


*** 18,24 ****

#include "SYS.h"

! SYSCALL2_RESTART(_so_connect,connect)
RET

    SET_SIZE(_so_connect)

--- 18,24 ----

#include "SYS.h"

! SYSCALL2(_so_connect,connect)
RET

    SET_SIZE(_so_connect)

% diff -c _so_accept.s.1.2 _so_accept.s
*** _so_accept.s.1.2 Thu May 22 14:38:48 1997
--- _so_accept.s Fri Jun 5 08:27:12 1998


*** 19,25 ****

#include "SYS.h"

! SYSCALL2_RESTART(_so_accept,accept)
RET

    SET_SIZE(_so_accept)

--- 19,25 ----

#include "SYS.h"

! SYSCALL2(_so_accept,accept)
RET

    SET_SIZE(_so_accept)

sctesrv 54:

##################### usr/src/lib/libc/i386/sys

% diff -c _so_connect.s.1.5 _so_connect.s
*** _so_connect.s.1.5 Fri Jun 5 08:59:52 1998
--- _so_connect.s Fri Jun 5 09:01:48 1998


*** 18,25 ****
movl $CONNECT,%eax
lcall $SYSCALL_TRAPNUM,$0
jae noerror

  •   cmpb    $ERESTART,%al
    
  •   je      _so_connect
      _prologue_
    
    m4_ifdef(DSHLIB', pushl %eax',
    --- 18,23 ----
    sctesrv 43: diff -c _so_accept.s.1.5
    diff: two filename arguments required
    sctesrv 44: diff -c _so_accept.s.1.5 _so_accept.s
    *** _so_accept.s.1.5 Fri Jun 5 09:02:12 1998
    --- _so_accept.s Fri Jun 5 09:02:53 1998

*** 18,25 ****
movl $ACCEPT,%eax
lcall $SYSCALL_TRAPNUM,$0
jae noerror

  •   cmpb    $ERESTART,%al
    
  •   je      _so_accept
      _prologue_
    
    m4_ifdef(DSHLIB', pushl %eax',
    --- 18,23 ----

State triggers:
Accepted: yes
Evaluated: yes
Evaluation:
4132657 covers the binary compatibility problem. When the sample program
from
4132657 is compiled and tested on 5.6, the result is:

    $ /ws/on297-tools/SUNWspro/SC4.2/bin/cc x.c -lsocket -lnsl
    $ ./x fade 15000 &
    18519
    $ kill -ALRM 18519
    $ SIGNAL CATCHED
    CONNECT: Interrupted system call

    $ wait

That is, the behavior is correct. Thus the only problem appears to be the
BCP one and this one isn't reproducible.

=================================
updated description with reporducable example

1998-06-16

1998-07-20

I thought this fix was being done as part of the escalation process
(4132567 and this are essentially the same bug for pre-kernel socket and
post-kernel socket source bases...not sure why they got split into
two bugs. The fixes are different because of different sources, bugs
are not). Will try to fix and test this. The code in Suggested Fix
should work.

1998-07-23

My guess is that this bug got split into 2.5.1 and 2.6-and-later versions
since this might be not-quite-easily fixable for 2.5.1 since that would
involve changing the restartable nature of getmsg()/putmsg() system calls.

The fix here is to make the system calls underlying calls for
connect() and accept() interfaces NOT restartable as they currently
(and erroneously) are. This makes the behavior compatible to SunOS4.x
and also fixes it for SunOS5.x [ The BCP interfaces are implemented
using the native OS interfaces, a BCP program just happens to have
uncovered this bug ]. The emails in the "Comments" section
further clairfy some of the technical background behind this fix.

The program in the description section tests only the connect() interface.
Test programs with slight modifications were used to test both the
connect() and accept() interfaces and those test programs have been
added to the attachments.

WITHOUT THE FIX, the observed behavior is as follows with output slightly
edited for clarity:

===
% ./accept_test 1234 &
[1] 668
Process id is 668
% truss -v all -p 668
accept(3, 0xEFFFF9D4, 0xEFFFF9C0, 1) (sleeping...)
^C% kill -ALRM 668
SIGNAL CATCHED
% truss -v all -p 668
accept(3, 0xEFFFF9D4, 0xEFFFF9C0, 1) (sleeping...)

Thus accept() call is restarted and continues sleeping.

    % ./connect_test bobo 1234 &
    [1] 671
    Process id is 671
    % kill -ALRM 671
    % SIGNAL CATCHED
    CONNECT: Operation already in progress
    [1]    Exit 1               ./connect_test bobo 1234

The connect() call is restarted and fails with EINPROGRESS

====

WITH THE FIX, the observed behavior is as follows with output slightly
edited for clarity:

===
% ./accept_test 1234 &
[1] 4523
Process id is 4523
% kill -ALRM 4523
% SIGNAL CATCHED
ACCEPT: Interrupted system call
[1] Exit 1 ./accept_test 1234

The accept() call now fails with EINTR even when SA_RESTART is set.

    % ./connect_test bobo 1234 &
    [1] 4525
    Process id is 4525
    % kill -ALRM 4525
    % SIGNAL CATCHED
    CONNECT: Interrupted system call
    [1]    Exit 1               ./connect_test bobo 1234

The connect() call now fails with EINTR even when SA_RESTART is set.

====

    Commit to fix in releases: generic, s998_20
    Fixed in releases: s998_20
    Integrated in releases: s998_20
    Verified in releases: 
    Closed because: 
    Incomplete because: 

Duplicate of:
Introduced in Release:
Root cause:
Program management:
Fix affects documentation: no
Exempt from dev rel: no
Fix affects L10N: no

Patch id:
Comments:

added sys5 example to description and reopened bug.

1998-06-16

1998-07-20

An archive of two emails which are part of discussions relevant to this bug
which also point to a man page deficiency.

=========

Roger,

Jim seems to claim that all system calls except connect() were
automatically
restarted after a signal in SunOS 4.X. Is this really true i.e. did 4.X
have different restart semantics for different system calls?
(I figured asking you would be quicker than reading the 4.x source.)

My assumption is that in 5.X SA_RESTART should/must apply to all
interruptible system calls i.e. that we should not treat connect()
differently. Correct?

Note that connect() is odd because it can fail with EINTR/ERESTART after
having started the connect attempt. Thus when connect() is restarted
the 2nd one might fail with EALREADY or EISCONN even though connect
was sucessful. I don't know of any other syscalls that modify "state"
before returning EINTR.

4.x never restarted anything other than what 5.x does with
SA_RESTART passed to sigaction(). SA_RESTART does not mean
that all interruptible system calls are restarted. Only a subset.
This is what the man page for sigaction(2) says.
This is also true of 4.x:

 SA_RESTART          If set and the signal is caught, certain
                     functions  that  are  interrupted by the
                     execution of this signal's  handler  are
                     transparently  restarted  by the system;
                     namely, read(2) or write(2) on slow dev-
                     ices like terminals, ioctl(2), fcntl(2),
                     wait(2), and waitid(2).  Otherwise, that
                     function returns an EINTR error.

Roger

MIME-Version: 1.0

Thanks for the info.

4.x never restarted anything other than what 5.x does with
SA_RESTART passed to sigaction(). SA_RESTART does not mean
that all interruptible system calls are restarted. Only a subset.
This is what the man page for sigaction(2) says.
This is also true of 4.x:

 SA_RESTART          If set and the signal is caught, certain
                     functions  that  are  interrupted by the
                     execution of this signal's  handler  are
                     transparently  restarted  by the system;
                     namely, read(2) or write(2) on slow dev-
                     ices like terminals, ioctl(2), fcntl(2),
                     wait(2), and waitid(2).  Otherwise, that
                     function returns an EINTR error.

The above man page doesn't take sockets into account.
In 4.X the source code tells me that restart also applies to
send, sendto, sendmsg, recv, recvmsg, recvfrom.

Thus we clearly need to fix the man page to say that for SA_RESTART.
But what about getmsg and putmsg on slow devices?
Shouldn't they get the same treatment as read/write/send*/recv*?

The SunOS 5.6 source code shows the following ERESTARTs:
fcntl
getmsg getpmsg NOT in man page
putmsg putpmsg NOT in man page
ioctl
read
pread readv NOT in man page
write
pwrite writev NOT in man page
wait waitid
connect accept THIS is a bug
recv recvfrom recvmsg NOT in man page
send sendto sendmsg NOT in man page

Thus the man page had it right 6 out of 14 prior to kernel sockets and
6 out of 20 with kernel sockets!!!

Tim, assuming Roger doesn't have an issue with documenting all 20, can you
file a man page bug to have the 14 missing calls added to the SA_RESTART
decription.

Also, fix the connect and accept wrappers in libc (sparc and x86) to not
use the restart macro/code. That will fix this "BCP problem".

Erik

================

See also: 4132657
History:
Submitter: wadej Date: Jun 5 1998 10:02AM
Dispatch operator: bugtraq Date: Jun 5 1998 10:02AM
Acceptor: cs Date: Jun 11 1998 1:19PM
Evaluator: cs Date: Jun 11 1998 1:19PM
Commit operator: mukesh Date: Jul 27 1998 5:14PM
Fix operator: mukesh Date: Jul 27 1998 5:14PM
Integrating operator: bmc Date: Jul 28 1998 12:24PM
Verify operator: Date:
Closeout operator: Date:
Called in by:

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

@p5pRT
Copy link
Author

p5pRT commented Oct 4, 1999

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From [Unknown Contact. See original ticket]

This bug still seems to be present in 5.7.0@​8221, only on Solaris.

-spp

We have encountered a very interesting problem on which you are really our
last resort​:

Exiting a child in a forking server (example on page 194 of 'Advanced Perl
Programming' O'Reilly) seems to clean-up the server socket of the parent on
newer levels of Solaris. The parent exits with a 'EBADF (Bad file number)'
after having served one client request.

We have tried almost everything within our power, e.g.​:

* compiling Perl on a working OS level and copying the binaries to
  the non-working OS level,
* compiling the current development version (5.005_61),
* different GNU compilers (2.8.1 and 2.95.1),
* SUN Workshop Compiler C/C++ 4.2,
* hacking in 'config.sh' (e.g. 'usevfork=false/true',
  multithreaded/non-multithreaded).

Nothing works out.

After issuing a bug report, SUN responded with the following​:
  <<FW​: Bug ID# 4146098>>
but this is to low-level for us to understand what's really going on. The
troubling patch from SUN seems to be 105210-17 or above.

In more understandable language they claimed that older versions of Solaris
had a bug, which is fixed in newer releases and that Perl has probably been
working around that bug. Now the bug is removed from the OS , Perl is still
working around, but this time unsuccesfully.

Server.pl​:
#!/usr/bin/perl

use IO​::Socket;

$SIG{CHLD} = sub { wait() };

$Sock = new IO​::Socket​::INET( LocalPort => 9000, Proto => 'tcp', Listen =>
SOMAXCONN, Reuse => 1 ) or die "SOCKET() error [$!]";

while ( $NewSock = $Sock->accept() )
{
  $Pid = fork();

  if ( $Pid == 0 )
  {
  while ( defined( $Buffer = <$NewSock> ) )
  {
  print( $Buffer );
  }

  exit( 0 );
  }
}

close( $Sock );

exit( 0 );

Client.pl​:
#!/usr/bin/perl

use IO​::Socket;

$Sock = new IO​::Socket​::INET( PeerAddr => 'tsesun01', PeerPort => 9000,
Proto => 'tcp' ) or die "SOCKET() error [$!]";

foreach ( 1..10 )
{
  print( $Sock "Msg $_​: How are you ?\n" );
}

close( $Sock );

exit( 0 );

Output on Solaris 2.6​:

nl1sahd1​:root> ./Server.pl
nl1sahd1​:root> jobs
[1] + Running ./Server.pl &
nl1sahd1​:root> ./Client.pl
nl1sahd1​:root> Msg 1​: How are you ?
Msg 2​: How are you ?
Msg 3​: How are you ?
Msg 4​: How are you ?
Msg 5​: How are you ?
Msg 6​: How are you ?
Msg 7​: How are you ?
Msg 8​: How are you ?
Msg 9​: How are you ?
Msg 10​: How are you ?

nl1sahd1​:root> jobs
[1] + Running ./Server.pl &

Server serves as many requests as it should be.

Output on Solaris 2.7​:

tsesun01​:root> ./Server.pl &
[1] 12331
tsesun01​:root> jobs
[1] + Running ./Server.pl &
tsesun01​:root> ./Client.pl
Msg 1​: How are you ?
Msg 2​: How are you ?
tsesun01​:root> Msg 3​: How are you ?
Msg 4​: How are you ?
Msg 5​: How are you ?
Msg 6​: How are you ?
Msg 7​: How are you ?
Msg 8​: How are you ?
Msg 9​: How are you ?
Msg 10​: How are you ?

[1] + Done ./Server.pl &
tsesun01​:root> jobs

Server only serves one request and ends !!!!!

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From [Unknown Contact. See original ticket]

perhaps on solaris 2.7 a shutdown is being performed on the socket when the
child closes. as an 'experiment/work around', try specifically close the listening socket
in the child as per below.

"Stephen P. Potter" wrote​:

Server.pl​:
#!/usr/bin/perl

use IO​::Socket;

$SIG{CHLD} = sub { wait() };

$Sock = new IO​::Socket​::INET( LocalPort => 9000, Proto => 'tcp', Listen =>
SOMAXCONN, Reuse => 1 ) or die "SOCKET() error [$!]";

while ( $NewSock = $Sock->accept() )
{
$Pid = fork();

     if \( $Pid == 0 \)
     \{

  close $Sock && $sockClosed=1;

             while \( defined\( $Buffer = \<$NewSock> \) \)
             \{
                     print\( $Buffer \);
             \}

             exit\( 0 \);
     \}

}

close( $Sock );

  close $Sock unless $sockClosed;

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From [Unknown Contact. See original ticket]

Lightning flashed, thunder crashed and ___cliff rayman___ <cliff@​genwax.com> wh
ispered​:
| perhaps on solaris 2.7 a shutdown is being performed on the socket when the
| child closes. as an 'experiment/work around', try specifically close the lis

tening socket
| in the child as per below.

I think the point being made in this report is that the script functions
differently between Solaris versions. Sun claims to have fixed a bug, that
we may have been working around, and that the work around may no longer be
needed and may be causing the problem.

-spp

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From @AlanBurlison

"Stephen P. Potter" wrote​:

We have encountered a very interesting problem on which you are really our
last resort​:

Exiting a child in a forking server (example on page 194 of 'Advanced Perl
Programming' O'Reilly) seems to clean-up the server socket of the parent on
newer levels of Solaris. The parent exits with a 'EBADF (Bad file number)'
after having served one client request.

We have tried almost everything within our power, e.g.​:

* compiling Perl on a working OS level and copying the binaries to
the non-working OS level,
* compiling the current development version (5.005_61),
* different GNU compilers (2.8.1 and 2.95.1),
* SUN Workshop Compiler C/C++ 4.2,
* hacking in 'config.sh' (e.g. 'usevfork=false/true',
multithreaded/non-multithreaded).

Nothing works out.

Right - I've read the bugrep, played with the example code and here is
the story. Prior to the fix, accept() and connect() were erroneously
being restarted when a signal was caught. The correct behaviour
according to the SVR4 spec is for them to return with EINTR, even if
SA_RESTART has been passed to sigaction().

The bugfix changed the behaviour so that if a signal was caught when
either accept() or connect() are in progress they fail with EINTR
instead of being restarted.

There are two ways to fix the example script. The first is to redo the
accept() if EINTR is returned. The problem with this approach is that
the IO​::Socket library doesn't check the return value of the accept()
call, and then tries to do some I/O ops [llseek()] on the invalid file
handle. This then means that by the time your script can get hold of
errno it is set to EBADF instead of EINTR.

The quick and easy fix is to ignore SIGCHILD rather than catching it -
this way no zombie child processes are created and no signals are
generated to screw up the accept() call. Change the line
  $SIG{CHLD} = sub { wait() };
to
  $SIG{CHLD} = 'IGNORE';
And the script then works as expected.

Hope that helps,

Alan Burlison
Solaris Kernel Development, Sun Microsystems

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From @jhi

There are two ways to fix the example script. The first is to redo the
accept() if EINTR is returned. The problem with this approach is that
the IO​::Socket library doesn't check the return value of the accept()
call, and then tries to do some I/O ops [llseek()] on the invalid file
handle. This then means that by the time your script can get hold of
errno it is set to EBADF instead of EINTR.

The quick and easy fix is to ignore SIGCHILD rather than catching it -

Can I still fix IO​::Socket? :-)

this way no zombie child processes are created and no signals are
generated to screw up the accept() call. Change the line
$SIG{CHLD} = sub { wait() };
to
$SIG{CHLD} = 'IGNORE';
And the script then works as expected.

Hope that helps,

Alan Burlison
Solaris Kernel Development, Sun Microsystems

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From @AlanBurlison

Jarkko Hietaniemi wrote​:

There are two ways to fix the example script. The first is to redo the
accept() if EINTR is returned. The problem with this approach is that
the IO​::Socket library doesn't check the return value of the accept()
call, and then tries to do some I/O ops [llseek()] on the invalid file
handle. This then means that by the time your script can get hold of
errno it is set to EBADF instead of EINTR.

The quick and easy fix is to ignore SIGCHILD rather than catching it -

Can I still fix IO​::Socket? :-)

Hey, you're the main man...

:-)

Actually I was surmising from the truss output that the problem was in
IO​::Socket. I had a quick look and it doesn't seem to be doing anything
naughty. I've had a look at pp_sys.c as well, and I can't see it there
either. Hmmm, wonder what is doing it?

Alan Burlison

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From @gbarr

On Fri, Dec 22, 2000 at 12​:09​:40AM +0000, Alan Burlison wrote​:

Jarkko Hietaniemi wrote​:

There are two ways to fix the example script. The first is to redo the
accept() if EINTR is returned. The problem with this approach is that
the IO​::Socket library doesn't check the return value of the accept()
call, and then tries to do some I/O ops [llseek()] on the invalid file
handle. This then means that by the time your script can get hold of
errno it is set to EBADF instead of EINTR.

The quick and easy fix is to ignore SIGCHILD rather than catching it -

Can I still fix IO​::Socket? :-)

Hey, you're the main man...

:-)

Actually I was surmising from the truss output that the problem was in
IO​::Socket. I had a quick look and it doesn't seem to be doing anything
naughty. I've had a look at pp_sys.c as well, and I can't see it there
either. Hmmm, wonder what is doing it?

It may be something along the lines that IO​::Socket​::accept creates
a new object which gets destroyed when the method exits with an error.
And during that destroy process various calls may be made I suppose.

If this is the case, changing the return to something like the following may help

  $peer = accept($new,$sock)
  or do { local $!; undef $new; return };

Graham.

@p5pRT
Copy link
Author

p5pRT commented Dec 21, 2000

From [Unknown Contact. See original ticket]

Lightning flashed, thunder crashed and Alan Burlison <Alan.Burlison@​uk.sun.com>
whispered​:
| There are two ways to fix the example script. The first is to redo the
| accept() if EINTR is returned. The problem with this approach is that
| the IO​::Socket library doesn't check the return value of the accept()
| call, and then tries to do some I/O ops [llseek()] on the invalid file
| handle. This then means that by the time your script can get hold of
| errno it is set to EBADF instead of EINTR.

What I'm getting from all this is that there isn't a perceived bug in perl,
so I should go ahead and close the ticket. Is that correct? How do I
explain that the script works as the user expects on other OSes (and
earlier versions of Solaris)? A bug in those other OSes?

-spp

@p5pRT
Copy link
Author

p5pRT commented Dec 22, 2000

From [Unknown Contact. See original ticket]

Alan Burlison <Alan.Burlison@​uk.sun.com> writes​:

The bugfix changed the behaviour so that if a signal was caught when
either accept() or connect() are in progress they fail with EINTR
instead of being restarted.

There are two ways to fix the example script. The first is to redo the
accept() if EINTR is returned. The problem with this approach is that
the IO​::Socket library doesn't check the return value of the accept()
call,

So we can consider this a bug in IO​::Socket.

@p5pRT
Copy link
Author

p5pRT commented Dec 22, 2000

From @AlanBurlison

What I'm getting from all this is that there isn't a perceived bug in perl,
so I should go ahead and close the ticket. Is that correct? How do I
explain that the script works as the user expects on other OSes (and
earlier versions of Solaris)? A bug in those other OSes?

Correct - there is no bug in perl (well, perhaps it should return EINTR
instead of EBADF...) I've tried to track down exactly which standard
mandates this behaviour, but without a lot of success. Signals are one
of the areas where different Unixes tend to differ wildly, and this
particular problem is a manefestation of those differences rather than a
bug per se - the behaviour will depend on which standards a particular
Unix is based on, and how closely it adheres to those standards.

The sigaction manpage for Solaris says this​:

  SA_RESTART
  If set and the signal is caught, functions that are
  interrupted by the execution of this signal's handler
  are transparently restarted by the system, namely
  fcntl(2), ioctl(2), wait(2),
  waitid(2), and the following functions on slow dev-
  ices like terminals​: getmsg() and getpmsg() (see
  getmsg(2)); putmsg() and putpmsg() (see putmsg(2));
  pread(), read(), and readv() (see read(2)); pwrite(),
  write(), and writev() (see write(2)); recv(),
  recvfrom(), and recvmsg() (see recv(3SOCKET)); and
  send(), sendto(), and sendmsg() (see send(3SOCKET).
  Otherwise, the function returns an EINTR error.

So in fact the behaviour seen is as documented on Solaris.

@p5pRT
Copy link
Author

p5pRT commented Aug 7, 2002

@gbarr - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant