Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getc() and ungetc() with unicode failure #12691

Closed
p5pRT opened this issue Jan 7, 2013 · 6 comments
Closed

getc() and ungetc() with unicode failure #12691

p5pRT opened this issue Jan 7, 2013 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 7, 2013

Migrated from rt.perl.org#116322 (status was 'resolved')

Searchable as RT116322$

@p5pRT
Copy link
Author

p5pRT commented Jan 7, 2013

From michael@negativespace.net

Created by michael@negativespace.net

This is a bug report for perl from michael@​negativespace.net,
generated with the help of perlbug 1.39 running under perl 5.16.2.

getc() is able to get a multibyte unicode character without problem,
but pushing a multibyte character back on to the stream causes all
sorts of problems. From the test cases below, it looks like ord() is
returning the correct unicode code point (0xC5), but ungetc() is
interpreting it incorrectly as a byte. And even after trying to put a
character back on the stream, the stream is completely broken and only
returns undef.

I've also tested this with files on disk, instead of in-memory
scalars. The error is the same.

See also

http​://stackoverflow.com/questions/14179751
http​://perlmonks.org/?node_id=1011831

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

use Test​::More;
my $builder = Test​::More->builder;
binmode $builder->output, "​:encoding(UTF-8)";
binmode $builder->failure_output, "​:encoding(UTF-8)";
binmode $builder->todo_output, "​:encoding(UTF-8)";

binmode(STDIN, '​:encoding(utf-8)');
binmode(STDOUT, '​:encoding(utf-8)');
binmode(STDERR, '​:encoding(utf-8)');

my $string = qq[aÅb]; # use utf8 makes $string UTF-8.
my $fh = IO​::File->new();
my $c;

$fh->open(\$string, '<​:encoding(UTF-8)');

$c = $fh->getc();
is($c, 'a');
$c = $fh->getc();
is($c, 'Å'); # U+00C5
$fh->ungetc(ord("Å"));
$c = $fh->getc();
{
  local $TODO = "ungetc() doesn't unget it";
  is($c, 'Å'); # U+00C5
}
{
  local $TODO = "The stream is broken at this point. getc() returns undef.";
  $c = $fh->getc();
  is($c, 'b'); # Stream broken at this point.
}

done_testing();

__END__

ok 1
ok 2
not ok 3 # TODO ungetc() doesn't unget it
# Failed (TODO) test at unicodify.pl line 32.
"\x{00c5}" does not map to utf8 at /usr/local/lib/perl5/5.16.2/Test/Builder.pm line 1759.
# got​: '\xC5b'
# expected​: 'Å'
not ok 4 # TODO The stream is broken at this point. getc() returns undef.
# Failed (TODO) test at unicodify.pl line 37.
# got​: undef
# expected​: 'b'
1..4

__END__

$ perl -v

This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-2level

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl 5.16.2:

Configured by michael at Wed Nov 21 10:46:56 PST 2012.

Summary of my perl5 (revision 5 version 16 subversion 2) configuration:
   
  Platform:
    osname=darwin, osvers=12.2.0, archname=darwin-2level
    uname='darwin bernard.local 12.2.0 darwin kernel version 12.2.0: sat aug 25 00:48:52 pdt 2012; root:xnu-2050.18.24~1release_x86_64 x86_64 '
    config_args='-de'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include',
    optimize='-O3',
    cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/opt/local/include'
    ccversion='', gccversion='4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/opt/local/lib'
    libpth=/opt/local/lib /usr/lib
    libs=-ldbm -ldl -lm -lutil -lc
    perllibs=-ldl -lm -lutil -lc
    libc=, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/opt/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.16.2:
    /usr/local/lib/perl5/site_perl/5.16.2/darwin-2level
    /usr/local/lib/perl5/site_perl/5.16.2
    /usr/local/lib/perl5/5.16.2/darwin-2level
    /usr/local/lib/perl5/5.16.2
    .


Environment for perl 5.16.2:
    DYLD_LIBRARY_PATH=:/usr/local/pgsql/lib
    HOME=/Users/michael
    LANG=en_CA
    LANGUAGE (unset)
    LC_ALL=C
    LD_LIBRARY_PATH=:/usr/local/pgsql/lib
    LOGDIR (unset)
    PATH=/usr/local/bin:/usr/local/mysql/bin:/usr/local/pgsql/bin:/Users/michael/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/git/bin:/usr/X11/bin:/opt/local/bin:/opt/local/sbin
    PERL_BADLANG (unset)
    SHELL=/bin/bash


@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2013

From @Leont

On Mon, Jan 7, 2013 at 6​:40 AM, Michael Joyce <perlbug-followup@​perl.org> wrote​:

getc() is able to get a multibyte unicode character without problem,
but pushing a multibyte character back on to the stream causes all
sorts of problems. From the test cases below, it looks like ord() is
returning the correct unicode code point (0xC5), but ungetc() is
interpreting it incorrectly as a byte. And even after trying to put a
character back on the stream, the stream is completely broken and only
returns undef.

I've also tested this with files on disk, instead of in-memory
scalars. The error is the same.

See also

http​://stackoverflow.com/questions/14179751
http​://perlmonks.org/?node_id=1011831

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

use Test​::More;
my $builder = Test​::More->builder;
binmode $builder->output, "​:encoding(UTF-8)";
binmode $builder->failure_output, "​:encoding(UTF-8)";
binmode $builder->todo_output, "​:encoding(UTF-8)";

binmode(STDIN, '​:encoding(utf-8)');
binmode(STDOUT, '​:encoding(utf-8)');
binmode(STDERR, '​:encoding(utf-8)');

my $string = qq[aÅb]; # use utf8 makes $string UTF-8.
my $fh = IO​::File->new();
my $c;

$fh->open(\$string, '<​:encoding(UTF-8)');

$c = $fh->getc();
is($c, 'a');
$c = $fh->getc();
is($c, 'Å'); # U+00C5
$fh->ungetc(ord("Å"));
$c = $fh->getc();
{
local $TODO = "ungetc() doesn't unget it";
is($c, 'Å'); # U+00C5
}
{
local $TODO = "The stream is broken at this point. getc() returns undef.";
$c = $fh->getc();
is($c, 'b'); # Stream broken at this point.
}

done_testing();

__END__

ok 1
ok 2
not ok 3 # TODO ungetc() doesn't unget it
# Failed (TODO) test at unicodify.pl line 32.
"\x{00c5}" does not map to utf8 at /usr/local/lib/perl5/5.16.2/Test/Builder.pm line 1759.
# got​: '\xC5b'
# expected​: 'Å'
not ok 4 # TODO The stream is broken at this point. getc() returns undef.
# Failed (TODO) test at unicodify.pl line 37.
# got​: undef
# expected​: 'b'
1..4

__END__

It seems ungetc is *entirely* unicode-unaware [1]. Looks like this can
be fixed easily though.

Leon

1​: http​://perl5.git.perl.org/perl.git/blob/HEAD​:/dist/IO/IO.xs#l327

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2013

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jan 10, 2013

From chansen@cpan.org

On Wed Jan 09 13​:37​:10 2013, LeonT wrote​:

It seems ungetc is *entirely* unicode-unaware [1]. Looks like this can
be fixed easily though.

Perhaps something like this​:

#ifdef PerlIO
  UV v;

  if ((SvIOK_notUV(c) && SvIV(c) < 0) || (SvNOK(c) && SvNV(c) < 0.0))
  croak("Negative character number in ungetc()");

  v = SvUV(c);
  if (v <= 0x7F || (v <= 0xFF && !PerlIO_isutf8(handle)))
  RETVAL = PerlIO_ungetc(handle, (int)v);
  else {
  U8 buf[UTF8_MAXBYTES + 1], *end;
  Size_t len;

  if (!PerlIO_isutf8(handle))
  croak("Wide character number in ungetc()");

  end = uvchr_to_utf8_flags(buf, v, 0); /* XXX flags? */
  len = end - buf;
  if (PerlIO_unread(handle, &buf, len) == len)
  XSRETURN_UV(v);
  else
  RETVAL = EOF;
  }
#else
  RETVAL = ungetc((int)SvIV(c), handle);
#endif

Returning -1 when ungetc() is unsuccessful feels wrong, wouldn't it be more perlish to return undef?

--
chansen

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2013

From @khwilliamson

Fixed by commit 10e621b
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2013

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant