Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in read() #7491

Closed
p5pRT opened this issue Sep 5, 2004 · 6 comments
Closed

Bug in read() #7491

p5pRT opened this issue Sep 5, 2004 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 5, 2004

Migrated from rt.perl.org#31459 (status was 'resolved')

Searchable as RT31459$

@p5pRT
Copy link
Author

p5pRT commented Sep 5, 2004

From chris@heathens.co.nz

Hi,

I just found a rather obscure bug in read(). It is scanning over
potentially unallocated memory, which I'm sure could make Perl crash if
you tried hard enough.

Here's my test case​:

echo A | perl -e '$c=chr(256).chr(32).chr(255);$c=chr(256);
read(STDIN,$c,1,3); $c=~s/(.)/sprintf("%02X ",ord($1))/eg;print $c,"\n"'

Output is​:
C4 80 00 00 00 41

Output should be​:
C4 80 00 00 41

It only happens when $c (the second parameter to read()) is a UTF8
string and the 4th parameter is an offset greater than length($c).

My Perl version is​:
This is perl, v5.8.3 built for i386-linux-thread-multi

but from looking at the source code, it looks like the bug is still
there in 5.8.5.

Chris

@p5pRT
Copy link
Author

p5pRT commented Sep 6, 2004

From chris@heathens.co.nz

Here is a patch for this bug.

HTH,
Chris

Inline Patch
--- pp_sys.c.orig	2004-09-05 22:54:59.000000000 -0400
+++ pp_sys.c	2004-09-05 23:00:37.000000000 -0400
@@ -1656,7 +1656,10 @@
     }
     if (DO_UTF8(bufsv)) {
 	/* convert offset-as-chars to offset-as-bytes */
-	offset = utf8_hop((U8 *)buffer,offset) - (U8 *) buffer;
+	if (offset >= (int)blen)
+	    offset += SvCUR(bufsv) - blen;
+	else
+	    offset = utf8_hop((U8 *)buffer,offset) - (U8 *) buffer;
     }
  more_bytes:
     bufsize = SvCUR(bufsv);

@p5pRT
Copy link
Author

p5pRT commented Sep 14, 2004

From @iabyn

On Mon, Sep 06, 2004 at 12​:03​:12AM -0400, Chris Heath wrote​:

I just found a rather obscure bug in read(). It is scanning over
potentially unallocated memory, which I'm sure could make Perl crash if
you tried hard enough.

Here's my test case​:

echo A | perl -e '$c=chr(256).chr(32).chr(255);$c=chr(256);
read(STDIN,$c,1,3); $c=~s/(.)/sprintf("%02X ",ord($1))/eg;print $c,"\n"'

Output is​:
C4 80 00 00 00 41

Output should be​:
C4 80 00 00 41

It only happens when $c (the second parameter to read()) is a UTF8
string and the 4th parameter is an offset greater than length($c).

My Perl version is​:
This is perl, v5.8.3 built for i386-linux-thread-multi

but from looking at the source code, it looks like the bug is still
there in 5.8.5.

Well-spotted!

Here is a patch for this bug.

Thanks, applied to the development branch of Perl as change #23321

P5Pers​: while looking at this bug, I noticed that a read from a
non-utf8 filehandle into a utf8 buffer variable forcibly downgrades
the variable from utf8, eg

$ echo 'B'|./perl -Ilib -MDevel​::Peek -e'$c="\x{100}A";Dump($c);read STDIN,$c,1,1;Dump($c)'
SV = PV(0x81af058) at 0x81c6760
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x81b2b08 "\304\200A"\0 [UTF8 "\x{100}A"]
  CUR = 3
  LEN = 4
SV = PVMG(0x81e3db8) at 0x81c6760
  REFCNT = 1
  FLAGS = (SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x81b2b08 "\304\200B"\0
  CUR = 3
  LEN = 4
  MAGIC = 0x81c0970
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = -1

Ideally I'd expect the resulting string to be "\x{100}B" rather than
"\304\200B"; or at least for a warning to be issued.

--
My get-up-and-go just got up and went.

@p5pRT
Copy link
Author

p5pRT commented Sep 14, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 19, 2004

From chris@heathens.co.nz

On Tue, 2004-09-14 at 18​:18, Dave Mitchell via RT wrote​:

P5Pers​: while looking at this bug, I noticed that a read from a
non-utf8 filehandle into a utf8 buffer variable forcibly downgrades
the variable from utf8, eg

$ echo 'B'|./perl -Ilib -MDevel​::Peek -e'$c="\x{100}A";Dump($c);read STDIN,$c,1,1;Dump($c)'
SV = PV(0x81af058) at 0x81c6760
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x81b2b08 "\304\200A"\0 [UTF8 "\x{100}A"]
CUR = 3
LEN = 4
SV = PVMG(0x81e3db8) at 0x81c6760
REFCNT = 1
FLAGS = (SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x81b2b08 "\304\200B"\0
CUR = 3
LEN = 4
MAGIC = 0x81c0970
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = -1

Ideally I'd expect the resulting string to be "\x{100}B" rather than
"\304\200B"; or at least for a warning to be issued.

You have a good point. I would be happy to take a stab at making a
patch, but perhaps we first need to clarify exactly what the behavior
should be. It gets a little tricky when you add "use encoding" into the
equation.

Example 1​:
echo $'\xFF' | perl -MDevel​::Peek -e'use encoding "iso-8859-2", STDIN =>
undef; my $c="\x{100}";Dump($c);read STDIN,$c,1,1;Dump($c)'

SV = PV(0x811f470) at 0x8129cf8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81264f8 "\304\200"\0 [UTF8 "\x{100}"]
  CUR = 2
  LEN = 3
SV = PVMG(0x81bc900) at 0x8129cf8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x81264f8 "\304\200\377"\0
  CUR = 3
  LEN = 4
  MAGIC = 0x81aa228
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = -1

We are saying that this example should probably return a utf8 string.
Should it return "\x{100}\x{2d9}"? (ISO-8859-2 maps \xFF to \x2D9.) It
seems a little weird, since we are applying the script's encoding to the
bytes read from STDIN even though we explicitly excluded STDIN from such
conversions. But it makes sense, because the . operator does the same
thing when it concatenates one utf8 and one non-utf8 string.

Example 2​:
echo $'\xFF' | perl -MDevel​::Peek -e'use encoding "iso-8859-2", STDIN =>
undef; my $c="";Dump($c);read STDIN,$c,1;Dump($c)'

SV = PV(0x811f470) at 0x8129cf0
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x81264f0 ""\0 [UTF8 ""]
  CUR = 0
  LEN = 1
SV = PVMG(0x81bc8f8) at 0x8129cf0
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
  IV = 0
  NV = 0
  PV = 0x81264f0 "\377"\0
  CUR = 1
  LEN = 2
  MAGIC = 0x81aa228
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = -1

Notice that $c is a utf8 string before read() is called, so the
difference between this example and example 1 is very small. However,
we could break too much existing code if we make Example 2 return a utf8
string.

Therefore, my suggestion would be that we only return a utf8 string if
offset is non-zero and if length($c) is non-zero, i.e. if the returned
string contains at least one character from the original value of $c.

Sound good?

Chris

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2008

@smpeters - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant