New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tell / getline problems on Win32 with unix-delimited files opened with encoding(UTF-8) #13471
Comments
From cm.perl@abtela.comThis is a bug report for perl from cm.perl@abtela.com, When you open a unix-delimited file (i.e., lines end in LF, not CRLF) This is a serious problem as it precludes any use of This seems to be the reason why Pod-Eventual-0.094001 fails tests Dist-Zilla, Pod-Weaver, Config-INI and other important CPAN distribs The attached test file io_tell_encoding.t illustrates the problem What this test program does is first establish a "reference" version my $io = IO::File->new($File, "<:$encoding") or die $!; for the following values of $encoding : "encoding(UTF-8)", "encoding(iso-8859-1)", "", "raw", "crlf", "utf8" any of which should be able to read without problem a pure ASCII, In a first batch of tests there is a call to tell($io) after each The test file is the test program itself (the comments at the end of When stored as a unix (LF delimited) file, this program yields Taisha:~/devbin/tmp $ perl io_tell_encoding.t We see that the test fails only for "encoding(...)" when tell($io) If the program is stored as a CRLF delimited file it yields instead Taisha:~/devbin/tmp $ perl io_tell_encoding.t now the only encoding that fails is ':raw', which is normal and I have tried to investigate further but after a few hours concluded Thank you for your time and attention. Flags: Site configuration information for perl 5.18.1: Configured by strawberry-perl at Tue Aug 13 19:21:46 2013. Summary of my perl5 (revision 5 version 18 subversion 1) configuration: Platform: Locally applied patches: @INC for perl 5.18.1: Environment for perl 5.18.1: PATH=E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\site\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\perl\bin;E:\cm\devbin\strawberry-perl-5.18.1.1-32bit-portable\c\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program |
From cm.perl@abtela.com#!./perl BEGIN { @encodings = ("encoding(UTF-8)", "encoding(iso-8859-1)", "", "raw", "crlf", "utf8"); use Test; BEGIN { plan tests => 2*@encodings } use IO::File; sub test { for my $tell (1, 0) { |
From cm.perl@abtela.comLe 16/12/2013 02:36, Christian Millour (via RT) a écrit :
Note that the problem is not limited to Win32. You get the same What is really needed here is a "permissive" :crlf layer, i.e. one The attached tentative patch implements a form of autodetection of the This patch seems to work with blead on linux and Win32, at least as a The second patch contains a modified version of io_tell_encoding.t Opinions / corrections / tests / smokes / alternatives welcome :) Regards, --Christian |
From cm.perl@abtela.com0001-add-test-file-for-permissive-crlf-layer-see-120797.patchFrom b769bca8110f69fbfbda61cfc9cf210e3cbc0b80 Mon Sep 17 00:00:00 2001
From: Christian Millour <cm.perl@abtela.com>
Date: Tue, 17 Dec 2013 20:34:45 +0100
Subject: [PATCH] add test file for permissive crlf layer (see #120797)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------1.7.4"
This is a multi-part message in MIME format.
--------------1.7.4
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit
---
dist/IO/t/io_tell_encoding.t | 49 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 49 insertions(+), 0 deletions(-)
create mode 100755 dist/IO/t/io_tell_encoding.t
--------------1.7.4
Content-Type: text/x-patch; name="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-add-test-file-for-permissive-crlf-layer-see-120797.patch"
diff --git a/dist/IO/t/io_tell_encoding.t b/dist/IO/t/io_tell_encoding.t
new file mode 100755
index 0000000..5283e26
--- /dev/null
+++ b/dist/IO/t/io_tell_encoding.t
@@ -0,0 +1,49 @@
+#!./perl
+# Test permissive read of LF-delimited files on Win32 with encoding
+# layers (see #120797).
+# This file should contain only ASCII and be stored LF-delimited to
+# exhibit the potential problems.
+my @encodings;
+
+BEGIN {
+ require strict; import strict;
+ @encodings = (":crlf:encoding(UTF-8)", ":crlf:encoding(iso-8859-1)");
+}
+
+use Test;
+
+BEGIN { plan tests => 2*@encodings }
+
+use IO::File;
+
+my $File = __FILE__;
+my @lines = do {
+ open(my $f, "<", $File) or die $!;
+ <$f>;
+};
+
+sub test {
+ my ($encoding, $tell, $actualref) = @_;
+ study;
+ my $io = IO::File->new($File, "<$encoding") or die $!;
+ $$actualref = join ":", q{}, PerlIO::get_layers($io);
+ my $cnt = 0;
+ while (defined (my $line = $io->getline)) {
+ $line eq $lines[$cnt]
+ or return "line $cnt, expected '$lines[$cnt]', got '$line'";
+ if ($tell) {
+ () = tell $io;
+ }
+ ++$cnt;
+ }
+ return "OK";
+}
+
+for my $tell (1, 0) {
+ for my $encoding (@encodings) {
+ my $actual;
+ ok(test($encoding, $tell, \$actual), "OK", "encoding = $encoding, actual = $actual, tell = $tell");
+ }
+}
+#a0a1a2a3a4a5a6a7a8a9
+#b0b1b2b3b4b5b6b7b8b9
--------------1.7.4--
|
From cm.perl@abtela.com0001-implement-permissive-crlf-layer-see-120797.patchFrom 6d5e12cf2c83280bfea291f35006279dedf14496 Mon Sep 17 00:00:00 2001
From: Christian Millour <cm.perl@abtela.com>
Date: Tue, 17 Dec 2013 20:25:28 +0100
Subject: [PATCH] implement permissive :crlf layer (see #120797)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="------------1.7.4"
This is a multi-part message in MIME format.
--------------1.7.4
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit
---
perlio.c | 14 +++++++++++++-
perliol.h | 1 +
2 files changed, 14 insertions(+), 1 deletions(-)
--------------1.7.4
Content-Type: text/x-patch; name="0001-implement-permissive-crlf-layer-see-120797.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-implement-permissive-crlf-layer-see-120797.patch"
diff --git a/perlio.c b/perlio.c
index d89b9bb..5f028af 100644
--- a/perlio.c
+++ b/perlio.c
@@ -4477,9 +4477,18 @@ PERLIO_FUNCS_DECL(PerlIO_pending) = {
* replaced by LF, or to the last CR of the buffer. In the former case
* the caller thinks that the buffer ends at c->nl + 1, in the latter
* that it ends at c->nl; these two cases can be distinguished by
- * *c->nl. c->nl is set during _getcnt() call, and unset during
+ * *c->nl. c->nl is set during _get_cnt() call, and unset during
* _unread() and _flush() calls.
+ *
* It only matters for read operations.
+ *
+ * The flag PERLIO_F_CRLFSAWCR is used to allow "permissive" CRLF, for
+ * e.g. reading LF (unix) delimited files on Win32, the issue being to
+ * prevent _unread() from translating back indiscriminately '\n' as
+ * a CR LF pair in that case (this is an issue because _unread() may
+ * be used quite a lot if there is an encoding(xxx) layer upstream).
+ * So the latter translation will be done only if this flag got set,
+ * which is done by _get_cnt() on finding a NATIVE_0xd.
*/
typedef struct {
@@ -4538,6 +4547,8 @@ SSize_t
PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
{
PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf);
+ if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF) || !(PerlIOBase(f)->flags & PERLIO_F_CRLFSAWCR))
+ return PerlIOBuf_unread(aTHX_ f, vbuf, count);
if (c->nl) { /* XXXX Shouldn't it be done only if b->ptr > c->nl? */
*(c->nl) = NATIVE_0xd;
c->nl = NULL;
@@ -4603,6 +4614,7 @@ PerlIOCrlf_get_cnt(pTHX_ PerlIO *f)
while (nl < b->end && *nl != NATIVE_0xd)
nl++;
if (nl < b->end && *nl == NATIVE_0xd) {
+ PerlIOBase(f)->flags |= PERLIO_F_CRLFSAWCR;
test:
if (nl + 1 < b->end) {
if (nl[1] == NATIVE_0xa) {
diff --git a/perliol.h b/perliol.h
index 82a3840..e26a331 100644
--- a/perliol.h
+++ b/perliol.h
@@ -91,6 +91,7 @@ struct _PerlIO {
#define PERLIO_F_TTY 0x00800000
#define PERLIO_F_NOTREG 0x01000000
#define PERLIO_F_CLEARED 0x02000000 /* layer cleared but not freed */
+#define PERLIO_F_CRLFSAWCR 0x04000000
#define PerlIOBase(f) (*(f))
#define PerlIOSelf(f,type) ((type *)PerlIOBase(f))
--------------1.7.4--
|
From @LeontOn Tue, Dec 17, 2013 at 9:32 PM, Christian Millour <cm.perl@abtela.com>wrote:
That whole method is an optimization anyway. I'm wondering if getting rid Leon |
From @Leont0001-Use-more-naive-unreading-for-crlf.patchFrom 509698a69731983ff66ac081beed45186e31b7f3 Mon Sep 17 00:00:00 2001
From: Leon Timmermans <fawaka@gmail.com>
Date: Thu, 19 Dec 2013 01:01:09 +0100
Subject: [PATCH] Use more naive unreading for crlf
---
perlio.c | 57 +--------------------------------------------------------
1 files changed, 1 insertions(+), 56 deletions(-)
diff --git a/perlio.c b/perlio.c
index d89b9bb..718b2b5 100644
--- a/perlio.c
+++ b/perlio.c
@@ -4533,61 +4533,6 @@ PerlIOCrlf_pushed(pTHX_ PerlIO *f, const char *mode, SV *arg, PerlIO_funcs *tab)
return code;
}
-
-SSize_t
-PerlIOCrlf_unread(pTHX_ PerlIO *f, const void *vbuf, Size_t count)
-{
- PerlIOCrlf * const c = PerlIOSelf(f, PerlIOCrlf);
- if (c->nl) { /* XXXX Shouldn't it be done only if b->ptr > c->nl? */
- *(c->nl) = NATIVE_0xd;
- c->nl = NULL;
- }
- if (!(PerlIOBase(f)->flags & PERLIO_F_CRLF))
- return PerlIOBuf_unread(aTHX_ f, vbuf, count);
- else {
- const STDCHAR *buf = (const STDCHAR *) vbuf + count;
- PerlIOBuf *b = PerlIOSelf(f, PerlIOBuf);
- SSize_t unread = 0;
- if (PerlIOBase(f)->flags & PERLIO_F_WRBUF)
- PerlIO_flush(f);
- if (!b->buf)
- PerlIO_get_base(f);
- if (b->buf) {
- if (!(PerlIOBase(f)->flags & PERLIO_F_RDBUF)) {
- b->end = b->ptr = b->buf + b->bufsiz;
- PerlIOBase(f)->flags |= PERLIO_F_RDBUF;
- b->posn -= b->bufsiz;
- }
- while (count > 0 && b->ptr > b->buf) {
- const int ch = *--buf;
- if (ch == '\n') {
- if (b->ptr - 2 >= b->buf) {
- *--(b->ptr) = NATIVE_0xa;
- *--(b->ptr) = NATIVE_0xd;
- unread++;
- count--;
- }
- else {
- /* If b->ptr - 1 == b->buf, we are undoing reading 0xa */
- *--(b->ptr) = NATIVE_0xa; /* Works even if 0xa ==
- '\r' */
- unread++;
- count--;
- }
- }
- else {
- *--(b->ptr) = ch;
- unread++;
- count--;
- }
- }
- }
- if (count > 0)
- unread += PerlIOBase_unread(aTHX_ f, (const STDCHAR *) vbuf + unread, count);
- return unread;
- }
-}
-
/* XXXX This code assumes that buffer size >=2, but does not check it... */
SSize_t
PerlIOCrlf_get_cnt(pTHX_ PerlIO *f)
@@ -4804,7 +4749,7 @@ PERLIO_FUNCS_DECL(PerlIO_crlf) = {
PerlIOBase_fileno,
PerlIOBuf_dup,
PerlIOBuf_read, /* generic read works with ptr/cnt lies */
- PerlIOCrlf_unread, /* Put CR,LF in buffer for each '\n' */
+ PerlIOBase_unread,
PerlIOCrlf_write, /* Put CR,LF in buffer for each '\n' */
PerlIOBuf_seek,
PerlIOBuf_tell,
--
1.7.1
|
The RT System itself - Status changed from 'new' to 'open' |
From @nwc10On Thu, Dec 19, 2013 at 01:58:05AM +0100, Leon Timmermans wrote:
Given this:
then:
it's not much of an optimsiation if it breaks things.
Is there any way to gauge how often ungetc() is called? [snip patch which removes a chunk of code] I like the direction that your suggested patch is taking the PerlIO codebase. Nicholas Clark |
From @jkeenanOn Tue, 17 Dec 2013 20:33:33 GMT, cm.perl@abtela.com wrote:
To make this discussion more visible, I have created the following smoke branch: smoke-me/jkeenan/120797-perlio
-- |
Mentioned on list recently, e.g., https://www.nntp.perl.org/group/perl.perl5.porters/2020/07/msg257916.html |
Has anyone tried to run my patch? That would be helpful |
Migrated from rt.perl.org#120797 (status was 'open')
Searchable as RT120797$
The text was updated successfully, but these errors were encountered: