Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] seek and tell operate on bytes #12250

Open
p5pRT opened this issue Jul 4, 2012 · 9 comments
Open

[META] seek and tell operate on bytes #12250

p5pRT opened this issue Jul 4, 2012 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 4, 2012

Migrated from rt.perl.org#113994 (status was 'open')

Searchable as RT113994$

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2012

From @doy

Created by @doy

Collecting all of the reports related to (sys)?seek and tell operating
on bytes rather than characters.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.16.0:

Configured by doy at Sat Jun  2 13:46:21 CDT 2012.

Summary of my perl5 (revision 5 version 16 subversion 0) configuration:
   
  Platform:
    osname=linux, osvers=3.3.5-1-arch, archname=x86_64-linux
    uname='linux xtahua 3.3.5-1-arch #1 smp preempt mon may 7 19:57:51 cest 2012 x86_64 gnulinux '
    config_args='-de -Dprefix=/home/doy/perl5/perlbrew/perls/perl-5.16.0'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.7.0 20120505 (prerelease)', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.15.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.15'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:
    


@INC for perl 5.16.0:
    /home/doy/perl5/local/
    /home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux
    /home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0
    /home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/5.16.0/x86_64-linux
    /home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/5.16.0
    .


Environment for perl 5.16.0:
    HOME=/home/doy
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/doy/perl5/perlbrew/bin:/home/doy/perl5/perlbrew/perls/perl-5.16.0/bin:/home/doy/.bin/marathon:/home/doy/.bin/nethack:/home/doy/.bin/ghc:/home/doy/.bin:/usr/local/sbin:/usr/local/bin:/usr/lib/ccache/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERL5LIB=/home/doy/perl5/local/
    PERLBREW_BASHRC_VERSION=0.33
    PERLBREW_HOME=/home/doy/.perlbrew
    PERLBREW_MANPATH=/home/doy/perl5/perlbrew/perls/perl-5.16.0/man
    PERLBREW_PATH=/home/doy/perl5/perlbrew/bin:/home/doy/perl5/perlbrew/perls/perl-5.16.0/bin
    PERLBREW_PERL=perl-5.16.0
    PERLBREW_ROOT=/home/doy/perl5/perlbrew
    PERLBREW_VERSION=0.43
    PERL_BADLANG (unset)
    PERL_CPANM_OPT=-q --mirror file:///home/doy/perl5/minicpan/ --mirror http://mirrors.kernel.org/cpan/ --mirror http://search.cpan.org/CPAN --prompt
    SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2012

From iszczesniak@gmail.com

On Wed, Jul 4, 2012 at 11​:17 PM, Jesse Luehrs <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by Jesse Luehrs
# Please include the string​: [perl #113994]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=113994 >

This is a bug report for perl from doy@​tozt.net,
generated with the help of perlbug 1.39 running under perl 5.16.0.

-----------------------------------------------------------------
[Please describe your issue here]

Collecting all of the reports related to (sys)?seek and tell operating
on bytes rather than characters.

[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags​:
category=core
severity=low
---
Site configuration information for perl 5.16.0​:

Configured by doy at Sat Jun 2 13​:46​:21 CDT 2012.

Summary of my perl5 (revision 5 version 16 subversion 0) configuration​:

Platform​:
osname=linux, osvers=3.3.5-1-arch, archname=x86_64-linux
uname='linux xtahua 3.3.5-1-arch #1 smp preempt mon may 7 19​:57​:51 cest 2012 x86_64 gnulinux '
config_args='-de -Dprefix=/home/doy/perl5/perlbrew/perls/perl-5.16.0'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=define, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2',
cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.7.0 20120505 (prerelease)', gccosandvers=''
intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.15.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.15'
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches​:

---
@​INC for perl 5.16.0​:
/home/doy/perl5/local/
/home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0/x86_64-linux
/home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/site_perl/5.16.0
/home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/5.16.0/x86_64-linux
/home/doy/perl5/perlbrew/perls/perl-5.16.0/lib/5.16.0
.

---
Environment for perl 5.16.0​:
HOME=/home/doy
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/doy/perl5/perlbrew/bin​:/home/doy/perl5/perlbrew/perls/perl-5.16.0/bin​:/home/doy/.bin/marathon​:/home/doy/.bin/nethack​:/home/doy/.bin/ghc​:/home/doy/.bin​:/usr/local/sbin​:/usr/local/bin​:/usr/lib/ccache/bin​:/usr/local/bin​:/usr/bin​:/bin​:/usr/local/sbin​:/usr/sbin​:/sbin​:/usr/bin/vendor_perl​:/usr/bin/core_perl
PERL5LIB=/home/doy/perl5/local/
PERLBREW_BASHRC_VERSION=0.33
PERLBREW_HOME=/home/doy/.perlbrew
PERLBREW_MANPATH=/home/doy/perl5/perlbrew/perls/perl-5.16.0/man
PERLBREW_PATH=/home/doy/perl5/perlbrew/bin​:/home/doy/perl5/perlbrew/perls/perl-5.16.0/bin
PERLBREW_PERL=perl-5.16.0
PERLBREW_ROOT=/home/doy/perl5/perlbrew
PERLBREW_VERSION=0.43
PERL_BADLANG (unset)
PERL_CPANM_OPT=-q --mirror file​:///home/doy/perl5/minicpan/ --mirror http​://mirrors.kernel.org/cpan/ --mirror http​://search.cpan.org/CPAN --prompt
SHELL=/bin/zsh

So how do you want to fix that? Seeking in characters instead of bytes
will require to actually read all characters between the old and new
position since there are encodings like ja_JP.PCK (ShiftJIS) which do
not allow a graceful recover if you seek into the middle of it and
then look for the next valid multibyte character sequence. There's
also no way to guess how many characters are between two file
positions since a single character can occupy from 1 up to MB_CUR_MAX
(usually 5 or 6 depending on platform) bytes.

Irek

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 4, 2012

From @doy

On Wed, Jul 04, 2012 at 02​:27​:30PM -0700, Irek Szczesniak via RT wrote​:

So how do you want to fix that? Seeking in characters instead of bytes
will require to actually read all characters between the old and new
position since there are encodings like ja_JP.PCK (ShiftJIS) which do
not allow a graceful recover if you seek into the middle of it and
then look for the next valid multibyte character sequence. There's
also no way to guess how many characters are between two file
positions since a single character can occupy from 1 up to MB_CUR_MAX
(usually 5 or 6 depending on platform) bytes.

I don't personally have any ideas, there just seemed to be a lot of
tickets related to this issue, and it wasn't clear that they should all
be immediately closed as wontfix. If we end up making that decision,
this will at least make it easier to clear out the tracker.

-doy

@p5pRT
Copy link
Author

p5pRT commented Jul 5, 2012

From @tonycoz

On Wed, Jul 04, 2012 at 04​:29​:48PM -0500, Jesse Luehrs wrote​:

I don't personally have any ideas, there just seemed to be a lot of
tickets related to this issue, and it wasn't clear that they should all
be immediately closed as wontfix. If we end up making that decision,
this will at least make it easier to clear out the tracker.

I think they should be wontfix.

If I have a character oriented stream that I've sensibly cached
positions for (by calling tell()), I can currently efficiently and
reasonably call seek() to return to that character in the file, even
though I've stored a byte position.

If seek()/tell() is changed to work in characters, with the "seek to
the beginning and count characters" then the code described above will
still work, but will become enourmously inefficient.

As a reference, C only defines fseek() to random positions for binary
streams, on text streams the only specified behaviour is for fseek(f,
0, (any SEEK_*)) and fseek*f, some_value_from_ftell, SEEK_SET).

Tony

@p5pRT
Copy link
Author

p5pRT commented Sep 6, 2013

From @rjbs

* Tony Cook <tony@​develop-help.com> [2012-07-04T21​:03​:01]

On Wed, Jul 04, 2012 at 04​:29​:48PM -0500, Jesse Luehrs wrote​:

I don't personally have any ideas, there just seemed to be a lot of
tickets related to this issue, and it wasn't clear that they should all
be immediately closed as wontfix. If we end up making that decision,
this will at least make it easier to clear out the tracker.

I think they should be wontfix.

I agree.

I have a filehandle with encoding(utf-8). It is a sequence of two-byte
sequences. I seek to the second byte of a pair and read. What should happen?

In UTF-8, there is no ambiguity. I have clearly tried to read mid-sequence.
It could die or it could return a replacement character, rewind, or skip ahead,
and warn while doing any of those. The layer could allow picking from these,
and I'm usually a fan of "when in doubt, die." But anyway, there's no
ambiguity, so we're good.

My understanding is that there exist encodings where we cannot so easily
determine our position (whether we are mid-sequence or not). I don't know of
any off hand, though. What about those? I really think the answer is going to
be along the lines of "the layer can try to help tell you that you messed up,
but don't *do* that.

--
rjbs

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2013

From @Leont

On Fri, Sep 6, 2013 at 3​:35 AM, Ricardo Signes <perl.p5p@​rjbs.manxome.org>wrote​:

My understanding is that there exist encodings where we cannot so easily
determine our position (whether we are mid-sequence or not). I don't know
of
any off hand, though. What about those? I really think the answer is
going to
be along the lines of "the layer can try to help tell you that you messed
up,
but don't *do* that.

iso-2022 is the most notorious one, though I'm not sure people use it
outside of MIME. Shift-JIS is another one. Though even UTF-16 can have the
issue if you're silly enough to do an odd seek.

This is all very "Doctor, it hurts when I press here" to me.

Leon

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2013

From @Tux

On Sat, 7 Sep 2013 09​:51​:14 +0200, Leon Timmermans <fawaka@​gmail.com>
wrote​:

On Fri, Sep 6, 2013 at 3​:35 AM, Ricardo Signes <perl.p5p@​rjbs.manxome.org>wrote​:

My understanding is that there exist encodings where we cannot so easily
determine our position (whether we are mid-sequence or not). I don't know
of
any off hand, though. What about those? I really think the answer is
going to
be along the lines of "the layer can try to help tell you that you messed
up,
but don't *do* that.

iso-2022 is the most notorious one, though I'm not sure people use it
outside of MIME. Shift-JIS is another one. Though even UTF-16 can have the
issue if you're silly enough to do an odd seek.

iso-6937/2 is another. And I never found tuits to make that into
Encoding

This is all very "Doctor, it hurts when I press here" to me.

Leon

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Sep 7, 2013

From @rjbs

* Leon Timmermans <fawaka@​gmail.com> [2013-09-07T03​:51​:14]

On Fri, Sep 6, 2013 at 3​:35 AM, Ricardo Signes <perl.p5p@​rjbs.manxome.org>wrote​:

My understanding is that there exist encodings where we cannot so easily
determine our position (whether we are mid-sequence or not). I don't know
of any off hand, though. What about those? I really think the answer is
going to be along the lines of "the layer can try to help tell you that you
messed up, but don't *do* that.

iso-2022 is the most notorious one, though I'm not sure people use it
outside of MIME. Shift-JIS is another one. Though even UTF-16 can have the
issue if you're silly enough to do an odd seek.

This is all very "Doctor, it hurts when I press here" to me.

Yes, as I said​: don't *do* that :-)

Oh, and with UTF-16, so I see! It hadn't occurred to me, but it's painfully
obvious, isn't it?

--
rjbs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants