tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

p5pRT · 2013-12-16T01:36:43Z

Migrated from rt.perl.org#120797 (status was 'open')

Searchable as RT120797$

p5pRT · 2013-12-16T01:36:43Z

From cm.perl@abtela.com

This is a bug report for perl from cm.perl@abtela.com,
generated with the help of perlbug 1.39 running under perl 5.18.1.

When you open a unix-delimited file (i.e., lines end in LF, not CRLF)
on Win32 with
my $io = new IO::File($filename, "<:encoding(...)")
a call to
tell $io;
seem to corrupt the handle / layers state to the point that the next
call to $io->getline does not return the next line as expected.

This is a serious problem as it precludes any use of
$io->input_line_number
(which makes a call to tell) for unix-delimited files opened this
way on Win32.

This seems to be the reason why Pod-Eventual-0.094001 fails tests
on Win32. It calls input_line_number on handles opened by
Mixin-Linewise-0.102 with a default encoding of ":encoding(UTF-8)"
(the introduction of this default encoding was apparently the
rationale for the latest versions of these two modules).

Dist-Zilla, Pod-Weaver, Config-INI and other important CPAN distribs
depend on these.

The attached test file io_tell_encoding.t illustrates the problem
(it has been coded in the style of dist/IO/t/io_linenum.t in the
hope that it would facilitate its integration... If Test::More can
be used I will be happy to provide a more explicit version).

What this test program does is first establish a "reference" version
of the list of lines to be read (using a 'traditional' open) and
then reads again the same file using various "extensions" :

my $io = IO::File->new($File, "<:$encoding") or die $!;

for the following values of $encoding :

"encoding(UTF-8)", "encoding(iso-8859-1)", "", "raw", "crlf", "utf8"

any of which should be able to read without problem a pure ASCII,
unix-delimited file. Each line read (with $io->getline) is compared
with the reference.

In a first batch of tests there is a call to tell($io) after each
$io->getline.
This call is omitted in a second batch.

The test file is the test program itself (the comments at the end of
the program text were crafted to make it easier to see the problem).

When stored as a unix (LF delimited) file, this program yields

Taisha:~/devbin/tmp $ perl io_tell_encoding.t
1..12
# Running under perl version 5.018001 for MSWin32
# Current time local: Mon Dec 16 01:45:11 2013
# Current time GMT: Mon Dec 16 00:45:11 2013
# Using Test.pm version 1.26
not ok 1
# Test 1 got: "line 1, expected 'my $File;\n', got '5a6a7a8a9\n'"
(io_tell_encoding.t at line 40)
# Expected: "OK" (encoding = encoding(UTF-8), tell = 1)
# io_tell_encoding.t line 40 is: ok(test($encoding, $tell), "OK",
"encoding = $encoding, tell = $tell");
not ok 2
# Test 2 got: "line 1, expected 'my $File;\n', got '5a6a7a8a9\n'"
(io_tell_encoding.t at line 40 fail #2)
# Expected: "OK" (encoding = encoding(iso-8859-1), tell = 1)
ok 3
ok 4
ok 5
ok 6
ok 7
ok 8
ok 9
ok 10
ok 11
ok 12
Taisha:~/devbin/tmp $

We see that the test fails only for "encoding(...)" when tell($io)
is called.

If the program is stored as a CRLF delimited file it yields instead

Taisha:~/devbin/tmp $ perl io_tell_encoding.t
1..12
# Running under perl version 5.018001 for MSWin32
# Current time local: Mon Dec 16 01:47:14 2013
# Current time GMT: Mon Dec 16 00:47:14 2013
# Using Test.pm version 1.26
ok 1
ok 2
ok 3
not ok 4
# Test 4 got: "line 0, expected '#!./perl\n', got '#!./perl\r\n'"
(io_tell_encoding.t at line 40 fail #4)
# Expected: "OK" (encoding = raw, tell = 1)
# io_tell_encoding.t line 40 is: ok(test($encoding, $tell), "OK",
"encoding = $encoding, tell = $tell");
ok 5
ok 6
ok 7
ok 8
ok 9
not ok 10
# Test 10 got: "line 0, expected '#!./perl\n', got '#!./perl\r\n'"
(io_tell_encoding.t at line 40 fail #10)
# Expected: "OK" (encoding = raw, tell = 0)
ok 11
ok 12
Taisha:~/devbin/tmp $

now the only encoding that fails is ':raw', which is normal and
unrelated to this ticket.

I have tried to investigate further but after a few hours concluded
that this problem was way over my head :(

Thank you for your time and attention.

Flags:
category=core
severity=critical

Site configuration information for perl 5.18.1:

Configured by strawberry-perl at Tue Aug 13 19:21:46 2013.

Summary of my perl5 (revision 5 version 18 subversion 1) configuration:

Platform:
osname=MSWin32, osvers=4.0, archname=MSWin32-x86-multi-thread-64int
uname='Win32 strawberry-perl 5.18.1.1 #1 Tue Aug 13 19:20:13 2013
i386'
config_args='undef'
hint=recommended, useposix=true, d_sigaction=undef
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=define, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='gcc', ccflags =' -s -O2 -DWIN32 -DPERL_TEXTMODE_SCRIPTS
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-fno-strict-aliasing -mms-bitfields',
optimize='-s -O2',
cppflags='-DWIN32'
ccversion='', gccversion='4.7.3', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long long', ivsize=8, nvtype='double', nvsize=8,
Off_t='long long', lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='g++.exe', ldflags ='-s
-L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE"
-L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"'
libpth=E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib
E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\i686-w64-mingw32\lib
libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
libc=, so=dll, useshrplib=true, libperl=libperl518.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-mdll -s
-L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\lib\CORE"
-L"E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\lib"'

Locally applied patches:

@INC for perl 5.18.1:
E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/site/lib
E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/vendor/lib
E:/cm/devbin/strawberry-perl-5.18.1.1-32bit-portable/perl/lib
.

Environment for perl 5.18.1:
CYGWIN=nodosfilewarning
HOME=e:/cm
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\site\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
Files (x86)\PuTTY;C:\Program Files (x86)\OpenOffice.org
3\program;C:\Program Files (x86)\QT Lite\QTSystem;c:\Program
Files\WinRAR;C:\Program Files (x86)\Calibre2\
PERL_BADLANG (unset)
SHELL (unset)

p5pRT · 2013-12-16T01:36:43Z

From cm.perl@abtela.com

#!./perl
my $File;
my @lines;
my @encodings;

BEGIN {
$File = __FILE__;
require strict; import strict;
@lines = do {
open(my $f, "<", $File) or die $!;
<$f>;
};

@encodings = ("encoding(UTF-8)", "encoding(iso-8859-1)", "", "raw", "crlf", "utf8");
}

use Test;

BEGIN { plan tests => 2*@encodings }

use IO::File;

sub test {
my ($encoding, $tell) = @_;
my $io = IO::File->new($File, "<:$encoding") or die $!;
my $cnt = 0;
while (defined (my $line = $io->getline)) {
$line eq $lines[$cnt]
or return "line $cnt, expected '$lines[$cnt]', got '$line'";
if ($tell) {
tell $io;
}
++$cnt;
}
return "OK";
}

for my $tell (1, 0) {
for my $encoding (@encodings) {
ok(test($encoding, $tell), "OK", "encoding = $encoding, tell = $tell");
}
}
#a0a1a2a3a4a5a6a7a8a9
#b0b1b2b3b4b5b6b7b8b9

p5pRT · 2013-12-17T20:33:33Z

From cm.perl@abtela.com

Le 16/12/2013 02:36, Christian Millour (via RT) a écrit :

When you open a unix-delimited file (i.e., lines end in LF, not CRLF)
on Win32 with
my $io = new IO::File($filename, "<:encoding(...)")
a call to
tell $io;
seem to corrupt the handle / layers state to the point that the next
call to $io->getline does not return the next line as expected.

Note that the problem is not limited to Win32. You get the same
misbehavior on unix when using "<:crlf:encoding(whatever)" on a
LF-delimited file.

What is really needed here is a "permissive" :crlf layer, i.e. one
that will allow reading either LF- or CRLF-delimited files. This is
more or less already the case in practice, unless you use an encoding
layer with :crlf. In that latter case, PerlIOEncode_flush() calls
PerlIO_unread(), which resolves to PerlIOCrlf_unread(), which
currently always translate back '\n' as a CR LF pair irrespective of
the original content, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the
delimiter actually used in the stream. It involves a new PerlIO flag,
currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag
is set by PerlIOCrlf_get_cnt() on finding a LF (actually a
NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if
the flag got set, and otherwise does a regulard PerlIOBuf_unread().
All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32, at least as a
proof of concept. Dedicating an PerlIO flag for this might look as a
stiff price to pay but it keeps things simple (I thought for a time
that playing with (PerlIOSelf(f, PerlIOCrlf))->nl might be enough but
have not been able to build a conviction that it would work in all
cases).

The second patch contains a modified version of io_tell_encoding.t
to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

Regards,

--Christian

p5pRT · 2013-12-17T20:33:33Z

From cm.perl@abtela.com

0001-add-test-file-for-permissive-crlf-layer-see-120797.patch

From b769bca8110f69fbfbda61cfc9cf210e3cbc0b80 Mon Sep 17 00:00:00 2001
From: Christian Millour <cm.perl@abtela.com>
Date: Tue, 17 Dec 2013 20:34:45 +0100
Subject: [PATCH] add test file for permissive crlf layer (see #120797)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------1.7.4"

This is a multi-part message in MIME format.
--------------1.7.4
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit

---
 dist/IO/t/io_tell_encoding.t |   49 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100755 dist/IO/t/io_tell_encoding.t


--------------1.7.4
Content-Type: text/x-patch; name="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch"

diff --git a/dist/IO/t/io_tell_encoding.t b/dist/IO/t/io_tell_encoding.t
new file mode 100755
index 0000000..5283e26
--- /dev/null
+++ b/dist/IO/t/io_tell_encoding.t
@@ -0,0 +1,49 @@
+#!./perl
+# Test permissive read of LF-delimited files on Win32 with encoding 
+# layers (see #120797).
+# This file should contain only ASCII and be stored LF-delimited to 
+# exhibit the potential problems.
+my @encodings;
+
+BEGIN {
+    require strict; import strict;
+    @encodings = (":crlf:encoding(UTF-8)", ":crlf:encoding(iso-8859-1)");
+}
+
+use Test;
+
+BEGIN { plan tests => 2*@encodings }
+
+use IO::File;
+
+my $File = __FILE__;
+my @lines = do {
+    open(my $f, "<", $File) or die $!;
+    <$f>;
+};
+
+sub test {
+    my ($encoding, $tell, $actualref) = @_;
+    study;
+    my $io = IO::File->new($File, "<$encoding") or die $!;
+    $$actualref = join ":", q{}, PerlIO::get_layers($io);
+    my $cnt = 0;
+    while (defined (my $line = $io->getline)) {
+	$line eq $lines[$cnt]
+	    or return "line $cnt, expected '$lines[$cnt]', got '$line'";
+	if ($tell) {
+	    () = tell $io;
+	}
+	++$cnt;
+    }
+    return "OK";
+}
+
+for my $tell (1, 0) {
+    for my $encoding (@encodings) {
+    	my $actual;
+	ok(test($encoding, $tell, \$actual), "OK", "encoding = $encoding, actual = $actual, tell = $tell");
+    }
+}
+#a0a1a2a3a4a5a6a7a8a9
+#b0b1b2b3b4b5b6b7b8b9

--------------1.7.4--

p5pRT · 2013-12-17T20:33:33Z

From cm.perl@abtela.com

0001-implement-permissive-crlf-layer-see-120797.patch

From 6d5e12cf2c83280bfea291f35006279dedf14496 Mon Sep 17 00:00:00 2001
From: Christian Millour <cm.perl@abtela.com>
Date: Tue, 17 Dec 2013 20:25:28 +0100
Subject: [PATCH] implement permissive :crlf layer (see #120797)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------1.7.4"

This is a multi-part message in MIME format.
--------------1.7.4
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit

---
 perlio.c  |   14 +++++++++++++-
 perliol.h |    1 +
 2 files changed, 14 insertions(+), 1 deletions(-)


--------------1.7.4
Content-Type: text/x-patch; name="0001-implement-permissive-crlf-layer-see-120797.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-implement-permissive-crlf-layer-see-120797.patch"

diff --git a/perlio.c b/perlio.c
index d89b9bb..5f028af 100644
--- a/perlio.c
+++ b/perlio.c
@@ -4477,9 +4477,18 @@ PERLIO_FUNCS_DECL(PerlIO_pending) = {
  * replaced by LF, or to the last CR of the buffer.  In the former case
  * the caller thinks that the buffer ends at c->nl + 1, in the latter
  * that it ends at c->nl; these two cases can be distinguished by
- * *c->nl.  c->nl is set during _getcnt() call, and unset during
+ * *c->nl.  c->nl is set during _get_cnt() call, and unset during
  * _unread() and _flush() calls.
+ *
  * It only matters for read operations.
+ *
+ * The flag PERLIO_F_CRLFSAWCR is used to allow "permissive" CRLF, for
+ * e.g. reading LF (unix) delimited files on Win32, the issue being to 
+ * prevent _unread() from translating back indiscriminately '\n' as
+ * a CR LF pair in that case (this is an issue because _unread() may
+ * be used quite a lot if there is an encoding(xxx) layer upstream).
+ * So the latter translation will be done only if this flag got set,
+ * which is done by _get_cnt() on finding a NATIVE_0xd.
  */
 
 typedef struct {
@@ -4538,6 +4547,8 @@ SSize_t
 PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
 {
     PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf);
+    if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF) || !(PerlIOBase(f)->flags & PERLIO_F_CRLFSAWCR))
+	return PerlIOBuf_unread(aTHX_ f, vbuf, count);
     if (c->nl) {	/* XXXX Shouldn't it be done only if b->ptr > c->nl? */
 	*(c->nl) = NATIVE_0xd;
 	c->nl = NULL;
@@ -4603,6 +4614,7 @@ PerlIOCrlf_get_cnt(pTHX_ PerlIO *f)
 	    while (nl < b->end && *nl != NATIVE_0xd)
 		nl++;
 	    if (nl < b->end && *nl == NATIVE_0xd) {
+                PerlIOBase(f)->flags |= PERLIO_F_CRLFSAWCR;
 	      test:
 		if (nl + 1 < b->end) {
 		    if (nl[1] == NATIVE_0xa) {
diff --git a/perliol.h b/perliol.h
index 82a3840..e26a331 100644
--- a/perliol.h
+++ b/perliol.h
@@ -91,6 +91,7 @@ struct _PerlIO {
 #define PERLIO_F_TTY		0x00800000
 #define PERLIO_F_NOTREG         0x01000000   
 #define PERLIO_F_CLEARED        0x02000000 /* layer cleared but not freed */
+#define PERLIO_F_CRLFSAWCR      0x04000000
 
 #define PerlIOBase(f)      (*(f))
 #define PerlIOSelf(f,type) ((type *)PerlIOBase(f))

--------------1.7.4--

p5pRT · 2013-12-19T00:59:22Z

From @Leont

On Tue, Dec 17, 2013 at 9:32 PM, Christian Millour <cm.perl@abtela.com>wrote:

Note that the problem is not limited to Win32. You get the same
misbehavior on unix when using "<:crlf:encoding(whatever)" on a
LF-delimited file.

What is really needed here is a "permissive" :crlf layer, i.e. one
that will allow reading either LF- or CRLF-delimited files. This is
more or less already the case in practice, unless you use an encoding
layer with :crlf. In that latter case, PerlIOEncode_flush() calls
PerlIO_unread(), which resolves to PerlIOCrlf_unread(), which
currently always translate back '\n' as a CR LF pair irrespective of
the original content, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the
delimiter actually used in the stream. It involves a new PerlIO flag,
currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag
is set by PerlIOCrlf_get_cnt() on finding a LF (actually a
NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if
the flag got set, and otherwise does a regulard PerlIOBuf_unread().
All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32, at least as a
proof of concept. Dedicating an PerlIO flag for this might look as a
stiff price to pay but it keeps things simple (I thought for a time
that playing with (PerlIOSelf(f, PerlIOCrlf))->nl might be enough but
have not been able to build a conviction that it would work in all
cases).

The second patch contains a modified version of io_tell_encoding.t
to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

That whole method is an optimization anyway. I'm wondering if getting rid
of it wouldn't be a better solution. It makes ungetc less efficient though,
I'm not sure how often it gets used (I thought I had previously applied a
patch to make eof not use it by default, but it appears not). May want to
reduce the usage before applying this though.

Leon

p5pRT · 2013-12-19T00:59:22Z

From @Leont

0001-Use-more-naive-unreading-for-crlf.patch

From 509698a69731983ff66ac081beed45186e31b7f3 Mon Sep 17 00:00:00 2001
From: Leon Timmermans <fawaka@gmail.com>
Date: Thu, 19 Dec 2013 01:01:09 +0100
Subject: [PATCH] Use more naive unreading for crlf

---
 perlio.c |   57 +--------------------------------------------------------
 1 files changed, 1 insertions(+), 56 deletions(-)

diff --git a/perlio.c b/perlio.c
index d89b9bb..718b2b5 100644
--- a/perlio.c
+++ b/perlio.c
@@ -4533,61 +4533,6 @@ PerlIOCrlf_pushed(pTHX_ PerlIO *f, const char *mode, SV *arg, PerlIO_funcs *tab)
     return code;
 }
 
-
-SSize_t
-PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
-{
-    PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf);
-    if (c->nl) {	/* XXXX Shouldn't it be done only if b->ptr > c->nl? */
-	*(c->nl) = NATIVE_0xd;
-	c->nl = NULL;
-    }
-    if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF))
-	return PerlIOBuf_unread(aTHX_ f, vbuf, count);
-    else {
-	const STDCHAR *buf = (const STDCHAR *) vbuf + count;
-	PerlIOBuf *b = PerlIOSelf(f, PerlIOBuf);
-	SSize_t unread = 0;
-	if (PerlIOBase(f)->flags & PERLIO_F_WRBUF)
-	    PerlIO_flush(f);
-	if (!b->buf)
-	    PerlIO_get_base(f);
-	if (b->buf) {
-	    if (!(PerlIOBase(f)->flags & PERLIO_F_RDBUF)) {
-		b->end = b->ptr = b->buf + b->bufsiz;
-		PerlIOBase(f)->flags |= PERLIO_F_RDBUF;
-		b->posn -= b->bufsiz;
-	    }
-	    while (count > 0 && b->ptr > b->buf) {
-		const int ch = *--buf;
-		if (ch == '\n') {
-		    if (b->ptr - 2 >= b->buf) {
-			*--(b->ptr) = NATIVE_0xa;
-			*--(b->ptr) = NATIVE_0xd;
-			unread++;
-			count--;
-		    }
-		    else {
-		    /* If b->ptr - 1 == b->buf, we are undoing reading 0xa */
-                        *--(b->ptr) = NATIVE_0xa;   /* Works even if 0xa ==
-                                                       '\r' */
-			unread++;
-			count--;
-		    }
-		}
-		else {
-		    *--(b->ptr) = ch;
-		    unread++;
-		    count--;
-		}
-	    }
-	}
-        if (count > 0)
-            unread += PerlIOBase_unread(aTHX_ f, (const STDCHAR *) vbuf + unread, count);
-	return unread;
-    }
-}
-
 /* XXXX This code assumes that buffer size >=2, but does not check it... */
 SSize_t
 PerlIOCrlf_get_cnt(pTHX_ PerlIO *f)
@@ -4804,7 +4749,7 @@ PERLIO_FUNCS_DECL(PerlIO_crlf) = {
     PerlIOBase_fileno,
     PerlIOBuf_dup,
     PerlIOBuf_read,             /* generic read works with ptr/cnt lies */
-    PerlIOCrlf_unread,          /* Put CR,LF in buffer for each '\n' */
+    PerlIOBase_unread,          
     PerlIOCrlf_write,           /* Put CR,LF in buffer for each '\n' */
     PerlIOBuf_seek,
     PerlIOBuf_tell,
-- 
1.7.1

p5pRT · 2013-12-19T00:59:22Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2013-12-21T18:39:08Z

From @nwc10

On Thu, Dec 19, 2013 at 01:58:05AM +0100, Leon Timmermans wrote:

On Tue, Dec 17, 2013 at 9:32 PM, Christian Millour <cm.perl@abtela.com>wrote:

Given this:

All bets are off though if the file being read uses both LF and CRLF.

then:

That whole method is an optimization anyway. I'm wondering if getting rid

it's not much of an optimsiation if it breaks things.

of it wouldn't be a better solution. It makes ungetc less efficient though,
I'm not sure how often it gets used (I thought I had previously applied a
patch to make eof not use it by default, but it appears not). May want to
reduce the usage before applying this though.

Is there any way to gauge how often ungetc() is called?

[snip patch which removes a chunk of code]

I like the direction that your suggested patch is taking the PerlIO codebase.

Nicholas Clark

p5pRT · 2017-02-28T17:20:46Z

From @jkeenan

On Tue, 17 Dec 2013 20:33:33 GMT, cm.perl@abtela.com wrote:

Le 16/12/2013 02:36, Christian Millour (via RT) a écrit :

When you open a unix-delimited file (i.e., lines end in LF, not CRLF)
on Win32 with
my $io = new IO::File($filename, "<:encoding(...)")
a call to
tell $io;
seem to corrupt the handle / layers state to the point that the next
call to $io->getline does not return the next line as expected.

Note that the problem is not limited to Win32. You get the same
misbehavior on unix when using "<:crlf:encoding(whatever)" on a
LF-delimited file.

What is really needed here is a "permissive" :crlf layer, i.e. one
that will allow reading either LF- or CRLF-delimited files. This is
more or less already the case in practice, unless you use an encoding
layer with :crlf. In that latter case, PerlIOEncode_flush() calls
PerlIO_unread(), which resolves to PerlIOCrlf_unread(), which
currently always translate back '\n' as a CR LF pair irrespective of
the original content, potentially thrashing the buffer in the process.

The attached tentative patch implements a form of autodetection of the
delimiter actually used in the stream. It involves a new PerlIO flag,
currently named (this is negotiable :)) PERLIO_F_CRLFSAWLF. This flag
is set by PerlIOCrlf_get_cnt() on finding a LF (actually a
NATIVE_0xd). PerlIOCrlf_unread() then does its specific work only if
the flag got set, and otherwise does a regulard PerlIOBuf_unread().
All bets are off though if the file being read uses both LF and CRLF.

This patch seems to work with blead on linux and Win32, at least as a
proof of concept. Dedicating an PerlIO flag for this might look as a
stiff price to pay but it keeps things simple (I thought for a time
that playing with (PerlIOSelf(f, PerlIOCrlf))->nl might be enough but
have not been able to build a conviction that it would work in all
cases).

The second patch contains a modified version of io_tell_encoding.t
to showcase the problem and test solutions on unix as well as Win32.

Opinions / corrections / tests / smokes / alternatives welcome :)

To make this discussion more visible, I have created the following smoke branch:

smoke-me/jkeenan/120797-perlio

Regards,

--Christian

--
James E Keenan (jkeenan@cpan.org)

jkeenan · 2020-07-05T22:15:43Z

Mentioned on list recently, e.g., https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg257916.html

Leont · 2020-07-06T18:01:47Z

Has anyone tried to run my patch? That would be helpful

p5pRT added hasPatch khw labels Oct 19, 2019

toddr removed the khw label Oct 25, 2019

jkeenan added the distro-mswin32 label Jul 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

p5pRT commented Dec 16, 2013

p5pRT commented Dec 16, 2013

p5pRT commented Dec 16, 2013

p5pRT commented Dec 17, 2013

p5pRT commented Dec 17, 2013

p5pRT commented Dec 17, 2013

p5pRT commented Dec 19, 2013

p5pRT commented Dec 19, 2013

p5pRT commented Dec 19, 2013

p5pRT commented Dec 21, 2013

p5pRT commented Feb 28, 2017

jkeenan commented Jul 5, 2020

Leont commented Jul 6, 2020

tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471

Comments

p5pRT commented Dec 16, 2013

p5pRT commented Dec 16, 2013

From cm.perl@abtela.com

p5pRT commented Dec 16, 2013

From cm.perl@abtela.com

p5pRT commented Dec 17, 2013

From cm.perl@abtela.com

p5pRT commented Dec 17, 2013

From cm.perl@abtela.com

p5pRT commented Dec 17, 2013

From cm.perl@abtela.com

p5pRT commented Dec 19, 2013

From @Leont

p5pRT commented Dec 19, 2013

From @Leont

p5pRT commented Dec 19, 2013

p5pRT commented Dec 21, 2013

From @nwc10

p5pRT commented Feb 28, 2017

From @jkeenan

jkeenan commented Jul 5, 2020

Leont commented Jul 6, 2020