Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activestate Perl, text files and tell() for Unix line endings #4131

Closed
p5pRT opened this issue Jun 25, 2001 · 5 comments
Closed

Activestate Perl, text files and tell() for Unix line endings #4131

p5pRT opened this issue Jun 25, 2001 · 5 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 25, 2001

Migrated from rt.perl.org#7177 (status was 'open')

Searchable as RT7177$

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2001

From Martin_Hosken@sil.org

Created by martin_hosken@sil.org

Perl for Win32 handles Unix style line endings of text files very well. But, if you do a tell() on a file with
Unix line endings, the tell() will be out by the number of lines read so far (i.e. it is adding 2 for each line
ending, even though there is only one byte there). The result is that tell() is giving the wrong value.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl v5.6.1:

Configured by Administrator at Wed May  2 01:31:01 2001.

Summary of my perl5 (revision 5 version 6 subversion 1) configuration:
  Platform:
    osname=MSWin32, osvers=4.0, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    usethreads=undef use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=undef d_sfio=undef uselargefiles=undef usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
  Compiler:
    cc='cl', ccflags ='-nologo -O1 -MD -DNDEBUG -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT  -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS
-DPERL_MSVCRT_READFIX',
    optimize='-O1 -MD -DNDEBUG',
    cppflags='-DWIN32'
    ccversion='', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4
    alignbytes=8, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='', ldflags ='-nologo -nodefaultlib -release  -libpath:"D:\Progs\Perl\lib\CORE"  -machine:x86'
    libpth="D:\Progs\Perl\lib\CORE"
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib wsock32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib  comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
 uuid.lib wsock32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=yes, libperl=perl56.lib
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -release  -libpath:"D:\Progs\Perl\lib\CORE"  -machine:x86'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY


@INC for perl v5.6.1:
    d:\src\perllib
    D:/Progs/Perl/lib
    D:/Progs/Perl/site/lib
    .


Environment for perl v5.6.1:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=d:\src\bat;D:\Progs\Perl\bin\;c:\dosutils;C:\WINNT\system32;C:\WINNT;C:\WINNT\System32\Wbem
    PERLLIB=d:\src\perllib
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2001

From [Unknown Contact. See original ticket]

Perl for Win32 handles Unix style line endings of text files very
well. But, if you do a tell() on a file with Unix line endings, the
tell() will be out by the number of lines read so far (i.e. it is
adding 2 for each line ending, even though there is only one byte
there). The result is that tell() is giving the wrong value.

Does this also happen when you use binmode() on the file?

I imagine that what's happening is something like this​: You open a
file and the RTL does this in text mode by default. Thus, it expects to
get "\r\n" sequences which it will translate into "\n". Then something
looks at the characters read after this preprocessing and sees an
"\n" and has no way of knowing whether this "\n" was really a "\r\n"
in its former life on disk or just a plain "\n" and assumes the former,
as that will usually be the case for text files on DOS-like systems.

I can imagine that turning on binmode() on files known not to be DOS
text files would disable this translation of character positions. It's
worth a try, anyway.

Cheers,
Philip

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2001

From [Unknown Contact. See original ticket]

I imagine that what's happening is something like this​: You open a
file and the RTL does this in text mode by default. Thus, it expects to
get "\r\n" sequences which it will translate into "\n". Then something
looks at the characters read after this preprocessing and sees an
"\n" and has no way of knowing whether this "\n" was really a "\r\n"
in its former life on disk or just a plain "\n" and assumes the former,
as that will usually be the case for text files on DOS-like systems.

I can imagine that turning on binmode() on files known not to be DOS
text files would disable this translation of character positions. It's
worth a try, anyway.

OK. Here is my take on this.

1. There used to be a bug in ActivePerl akin to this, in that a normal DOS
line ended file opened as a text file would cause tell() to lie. This was
fixed (thankfully) somewhere around 509 (complete guess). So this problem
is merely an extension of that problem.
2. I can't see anything in the code that does any special handling for text
files when doing tell(), but then I am a core code neophyte. So, maybe it
is an OS related bug. Mind you I was under the impression that Perl was
reimplementing all of stdio.
3. While binmode()ing the file may fix the problem, I didn't expect this
Unix line ended file to end up on my system (it was created through a bug
in Winzip - not extracting a text file from a .tar.gz into DOS line endings
like it normally does and did for all the other text files!), so I couldn't
really plan ahead. Either Perl should reject the file or handle tell()
correctly (?). Hence my bug report.
4. On reflection, how would I find out whether I had to binmode the file
without opening the thing twice (once in binmode to find out if it is a
normal DOS file and then again in text mode since you can't unbinmode a
file, especially after you have read from it!)?
5. How about if the Win32 Perl automagically binmoded a file with Unix line
endings. Would that be dwimish enough without repercussions? But then Win32
Perl would have to open the file twice or something, I guess.

Thanks for looking into this.

Martin

@p5pRT
Copy link
Author

p5pRT commented Jun 26, 2001

From [Unknown Contact. See original ticket]

Also notice that the described behaviour is perfectly valid. The result of
ftell is really no more than a cookie that you can give back to fseek. The
idea that it is a byte offset is something people are used to, but is not
in fact in the C standard (and I expect that this thing will get even more
clear if we ever get things like files with utf8 encoded content)

E.g. when I used C on the IBM 370, they had something called "variable format"
files, where each "line" is a variable record that can be independently
shrunk or extended to size 65535. The result of a tell on that system was
record_number*65536+record_offset

@richardleach
Copy link
Contributor

Although the above comment is worth keeping in mind, I tried to recreate this on Strawberry Perl v5.30.0 on Win 10.

Using the following saved as a text file:

1
2
3
4
5
6
7
8
9
10

with this script:

use v5.30;

my $line = 1;
open my $FILE,"<",$ARGV[0] or die "$!\n";
while (my $blurb = <$FILE>) {
  chomp $blurb;
  say "$line: $blurb : ".tell($FILE);
  $line++;
}
close $FILE;

With CRLF endings, the output was:

1: 1 : 3
2: 2 : 6
3: 3 : 9
4: 4 : 12
5: 5 : 15
6: 6 : 18
7: 7 : 21
8: 8 : 24
9: 9 : 27
10: 10 : 31

With LF endings, the output was:

1: 1 : 2
2: 2 : 4
3: 3 : 6
4: 4 : 8
5: 5 : 10
6: 6 : 12
7: 7 : 14
8: 8 : 16
9: 9 : 18
10: 10 : 21

That looks like the expected output. Seems like the reported problem has been fixed sometime down the years - or else is a bug for ActiveState not P5P - and this ticket can be closed.

@xsawyerx xsawyerx added the Closable? We might be able to close this ticket, but we need to check with the reporter label Dec 29, 2019
@toddr toddr closed this as completed Feb 13, 2020
@toddr toddr removed the Closable? We might be able to close this ticket, but we need to check with the reporter label Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants