sysread in utf8 mode returns undefined result, but read input #8184

p5pRT · 2005-11-02T02:57:50Z

Migrated from rt.perl.org#37585 (status was 'open')

Searchable as RT37585$

p5pRT · 2005-11-02T02:57:51Z

From jutta@pobox.com

Created by jutta@pobox.com

This has been seen on perl 5.8.7 and 5.8.6.

I'm reading via a nonblocking binary TCP socket in "utf8" mode,
using sysread.

I'm reading a certain number of characters. If a server sends
me that exact number of _bytes_, but fewer than that many _characters_
-- in other words, if the server's data contains UTF-8 characters
and happens to just have exactly my buffer size --, the sysread
returns with an undefined character count and errno set to EAGAIN.

The buffer I pass in nevertheless contains the bytes read from
the socket - they're there, it's just that sysread's return value
denies their existence.

Instead of returning undefined, sysread()s result should reflect the
number of characters appended to the passed-in buffer, and the EAGAIN
should not be returned until the client's next attempt to read.

Perl Info


Flags:
    category=core
    severity=medium

This perlbug was built using Perl v5.8.6 - Sat Mar 19 17:36:09 UTC 2005
It is being executed now by  Perl v5.8.6 - Sat Mar 19 17:31:24 UTC 2005.

Site configuration information for perl v5.8.6:

Configured by abuild at Sat Mar 19 17:31:24 UTC 2005.

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=linux, osvers=2.6.9, archname=x86_64-linux-thread-multi
    uname='linux bach 2.6.9 #1 smp fri jul 2 14:21:59 utc 2004 x86_64 x86_64 x86_64 gnulinux '
    config_args='-ds -e -Dprefix=/usr -Dvendorprefix=/usr -Dinstallusrbinperl -Dusethreads -Di_db -Di_dbm -Di_ndbm -Di_gdbm -Duseshrplib=true -Doptimize=-O2 -fmessage-length=0 -Wall -g -Wall -pipe'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=define use64bitall=define uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -fmessage-length=0 -Wall -g -Wall -pipe',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGING -fno-strict-aliasing -pipe'
    ccversion='', gccversion='3.3.5 20050117 (prerelease) (SUSE Linux)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib64'
    libpth=/lib64 /usr/lib64 /usr/local/lib64
    libs=-lm -ldl -lcrypt -lpthread
    perllibs=-lm -ldl -lcrypt -lpthread
    libc=/lib64//lib64/libc.so.6, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.6/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib64'

Locally applied patches:
    


@INC for perl v5.8.6:
    /usr/lib/perl5/5.8.6/x86_64-linux-thread-multi
    /usr/lib/perl5/5.8.6
    /usr/lib/perl5/site_perl/5.8.6/x86_64-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.6
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.6/x86_64-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.6
    /usr/lib/perl5/vendor_perl
    .


Environment for perl v5.8.6:
    HOME=/home/jutta
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/usr/X11R6/lib64:/usr/X11R6/lib
    LOGDIR (unset)
    PATH=/sbin:/usr/sbin:/home/jutta/shf:/sbin:/usr/sbin:/home/jutta/shf:/sbin:/usr/sbin:/home/jutta/shf:/sbin:/usr/sbin:/home/jutta/shf:/sbin:/usr/sbin:/home/jutta/shf:/sbin:/usr/sbin:/home/jutta/shf:/home/jutta/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

--3D161130687.1130893811/jutta.metaweb.com--

p5pRT · 2005-11-02T13:17:30Z

From @smpeters

[jutta@pobox.com - Tue Nov 01 18:57:51 2005]:

This is a bug report for perl from jutta@pobox.com,
generated with the help of perlbug 1.35 running under perl v5.8.6.

-----------------------------------------------------------------
[Please enter your report here]

This has been seen on perl 5.8.7 and 5.8.6.

I'm reading via a nonblocking binary TCP socket in "utf8" mode,
using sysread.

I'm reading a certain number of characters. If a server sends
me that exact number of _bytes_, but fewer than that many _characters_
-- in other words, if the server's data contains UTF-8 characters
and happens to just have exactly my buffer size --, the sysread
returns with an undefined character count and errno set to EAGAIN.

The buffer I pass in nevertheless contains the bytes read from
the socket - they're there, it's just that sysread's return value
denies their existence.

Instead of returning undefined, sysread()s result should reflect the
number of characters appended to the passed-in buffer, and the EAGAIN
should not be returned until the client's next attempt to read.

Would it be possible for you to send us some sample code that
demonstrates the problem? It makes it difficult to replicate and test
the problem without it.

p5pRT · 2005-11-02T13:17:32Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2005-11-02T17:04:51Z

From jutta@panix.com

[Steve Peters via RT]

[jutta@pobox.com - Tue Nov 01 18:57:51 2005]:

This is a bug report for perl from jutta@pobox.com,
generated with the help of perlbug 1.35 running under perl v5.8.6.

-----------------------------------------------------------------
[Please enter your report here]

This has been seen on perl 5.8.7 and 5.8.6.

I'm reading via a nonblocking binary TCP socket in "utf8" mode,
using sysread.

I'm reading a certain number of characters. If a server sends
me that exact number of _bytes_, but fewer than that many _characters_
-- in other words, if the server's data contains UTF-8 characters
and happens to just have exactly my buffer size --, the sysread
returns with an undefined character count and errno set to EAGAIN.

The buffer I pass in nevertheless contains the bytes read from
the socket - they're there, it's just that sysread's return value
denies their existence.

Instead of returning undefined, sysread()s result should reflect the
number of characters appended to the passed-in buffer, and the EAGAIN
should not be returned until the client's next attempt to read.

Would it be possible for you to send us some sample code that
demonstrates the problem? It makes it difficult to replicate and test
the problem without it.

Sure. Here's a little TCP server, written in C, and a perl
client. (The server is probably a two-line perl script, but I
don't actually speak perl well enough to be really sure what's
going on, so bear with me.)

This should compile under anything unixy -- save to foo.c, gcc -o
foo foo.c, ./foo 1234 to start on port 1234 (which is what the client
uses.)

================================= BEGIN =============================
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <netinet/in.h>
#include <fcntl.h>
#include <assert.h>

static int
setup_listen_sock(char const * interface, short port) {

int one = 1;
int s;
int rc;
struct sockaddr_in server;

s = socket(AF_INET, SOCK_STREAM, 0);
assert(s != -1);

memset(&server, 0, sizeof(server));
server.sin_family = AF_INET;
server.sin_port = htons(port);
if (interface)
server.sin_addr.s_addr = inet_addr(interface);
else server.sin_addr.s_addr = INADDR_ANY;

setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (void *)&one, sizeof(one));

umask(0);
rc = bind(s, (struct sockaddr *)&server, sizeof(server));
if (rc == -1) {
close(s);
return -1;
}

if (listen(s, 100) == -1) {
close(s);
return -1;
}
return s;
}

int main(int ac, char **av)
{
int listener;
int conn;
size_t len;
struct sockaddr_in client;
short port;
char const * interface = NULL;

if (ac < 2 || ac > 3) {
fprintf(stderr, "usage: %s port [interface]\n", av[0]);
exit(64);
}

port = atoi(av[1]);
if (ac > 2)
interface = av[2];

listener = setup_listen_sock(interface, port);
if (listener == -1) {
printf("setup_listen_sock returned -1: %s\n", strerror(errno));
exit (1);
}

memset(&client, 0, sizeof(client));
len = sizeof(client);

while(1) {
/* Twelve normal bytes, followed by twelve bytes that include
* three encoding UCS #2019, an old-fashioned apostrope. */

static char const greeting[] =
"0123456789a>we\342\200\231re done";
size_t greeting_n = sizeof(greeting) - 1;
ssize_t cc;

conn = accept(
listener,
(struct sockaddr *)&client,
(socklen_t *)&len);
assert(conn != -1);
cc = write(conn, greeting, greeting_n);
assert(cc == greeting_n);

/* Pretend that with the first part of the protocol completed,
* the application now waits for a reaction from the other
* side...
*/
}
}
================================= END =============================

And here's the perl "client" that reads in blocks of 12 UTF characters.

================================= BEGIN =============================
#!/usr/bin/perl -w

use strict;
use utf8;

# constants
my( $HOST ) = "localhost";
my( $PORT ) = "1234";
my( $BUFSIZE ) = "12";

# libraries
use IO::Socket;
use POSIX qw(:errno_h);

# open a socket
my( $socket ) =
IO::Socket::INET->
new( Proto => "tcp",
PeerAddr => $HOST,
PeerPort => "gdb(" . $PORT . ")",
Blocking => 0 );

# put the socket in UTF-8 mode.
binmode $socket, ":utf8";

my( $retval, $buffer );

# read from the socket.

while (1) {

$retval = sysread($socket, $buffer, $BUFSIZE);

if ( !defined($retval) && ( $! == EAGAIN ))
{
print "read failed: ", $!, "\r\n";
print "(buffer is: '", $buffer, "')\r\n";
last;
}
if ( $retval == 0 ) { last; }

print "read: ", $retval, ", buffer: '", $buffer, "'\r\n";
}

exit 0;
================================= END =============================

The client's socket is set to is non-blocking. So, if, as in
the second case, fewer than 12 UTF-8 characters are available
(namely, 10), I'd expect to simply be returned those ten -- with
the next read returning EAGAIN.

If you replace the UTF8 apostrophe from the server data wtih a
simple us-ascii ', that's exactly what happens --

./test.pl
read: 12, buffer: '0123456789a>'
read: 10, buffer: 'we're done'
read failed: Resource temporarily unavailable
(buffer is: 'we're done')

And if you make the text just one byte shorter or longer (E.g. write
"we're ok." or "we're done." rather than "we're done"),
that's exactly what happens--

./test.pl
read: 12, buffer: 0123456789a>'
Wide character in print at ./test.pl line 45.
read: 9, buffer: weâ��re ok.'
read failed: Resource temporarily unavailable
Wide character in print at ./test.pl line 39.
(buffer is: weâ��re ok.)

and
./test.pl
read: 12, buffer: 0123456789a>'
Wide character in print at ./test.pl line 45.
read: 11, buffer: weâ��re done.'
read failed: Resource temporarily unavailable
Wide character in print at ./test.pl line 39.
(buffer is: weâ��re done.)

it's only if there's an UTF-8 character present and the _byte_ --
not character -- count hits the buffer size exactly that the error
result from the second "refill" UTF8 read inside sysread
implementation somehow overrides the partial results from
the first read:

./utf8read.pl
read: 12, buffer: '0123456789a>'
read failed: Resource temporarily unavailable
(buffer is: 'we\xe2\x80\x99re done')

Note that the characters did "make it" into the buffer, it's
just that the return value doesn't count them.

Hope this reproduces on your system and helps explain what's
going on--let me know if you need more.

Cheers,

Jutta <jutta@pobox.com>

Leont · 2020-03-04T21:27:13Z

sysread and syswrite are no longer allowed on :utf8 handles, so this ticket can be closed.

p5pRT added Severity Medium distro-Linux type-PerlIO type-Unicode type-core labels Oct 18, 2019

p5pRT added the khw label Oct 25, 2019

toddr removed the khw label Oct 25, 2019

Leont closed this as completed Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sysread in utf8 mode returns undefined result, but read input #8184

sysread in utf8 mode returns undefined result, but read input #8184

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

Leont commented Mar 4, 2020

sysread in utf8 mode returns undefined result, but read input #8184

sysread in utf8 mode returns undefined result, but read input #8184

Comments

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

From jutta@pobox.com

Created by jutta@pobox.com

p5pRT commented Nov 2, 2005

From @smpeters

p5pRT commented Nov 2, 2005

p5pRT commented Nov 2, 2005

From jutta@panix.com

Leont commented Mar 4, 2020