Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chomp() can be confusing #876

Closed
p5pRT opened this issue Nov 19, 1999 · 29 comments
Closed

chomp() can be confusing #876

p5pRT opened this issue Nov 19, 1999 · 29 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 19, 1999

Migrated from rt.perl.org#1807 (status was 'rejected')

Searchable as RT1807$

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From Ben_Tilly@trepp.com

Is there any possibility of having chomp() be modified to recognize \n, \r,
and \r\n as line-endings to chomp? A source of nasty confusion for people
working in a cross-platform environment is when the same exact script gives
very different results on the same exact file depending on whether you are
running under *nix or Windows. (Particularly an issue with Samba because
people wind up reading under one system files created under the other.)

Yes, the current behaviour works as documented. But it leads to code not
doing what people expect, and a confused person can easily spend several
hours confused...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From Ben_Tilly@trepp.com

Is there any possibility of having Perl's chomp() command be modified to
recognize \n, \r, and \r\n as line-endings to chomp? A source of nasty
confusion for people working in a cross-platform environment is when
identical Perl scripts give very different results on the same exact file
depending on whether you are running under *nix or Windows. (Particularly
an issue with Samba because people wind up reading under one system files
created under the other.)

Yes, the current behaviour works as documented. But it leads to code not
doing what people expect, and in many cases a confused person will spend
several hours confused...

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

In message <OF47CC4F38.CF9B333E-ON8525682E.0054550B@​trepp.com>,
  Ben_Tilly@​trepp.com writes​:

: Is there any possibility of having Perl's chomp() command be modified to
: recognize \n, \r, and \r\n as line-endings to chomp? A source of nasty
: confusion for people working in a cross-platform environment is when
: identical Perl scripts give very different results on the same exact file
: depending on whether you are running under *nix or Windows. (Particularly
: an issue with Samba because people wind up reading under one system files
: created under the other.)

Have a look at <URL​:http​://language.perl.com/ppt/src/nlcvt/nlcvt> for
an example of how to do what you want (or at least something similar).

Greg

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

In message <OF47CC4F38.CF9B333E-ON8525682E.0054550B@​trepp.com>,
Ben_Tilly@​trepp.com writes​:

: Is there any possibility of having Perl's chomp() command be modified
to
: recognize \n, \r, and \r\n as line-endings to chomp? A source of nasty
: confusion for people working in a cross-platform environment is when
: identical Perl scripts give very different results on the same exact
file
: depending on whether you are running under *nix or Windows.
(Particularly
: an issue with Samba because people wind up reading under one system
files
: created under the other.)

Have a look at <URL​:http​://language.perl.com/ppt/src/nlcvt/nlcvt> for
an example of how to do what you want (or at least something similar).

When I said, "I talked to" I meant that. I don't need the pointer - I know
how to handle it. But I wind up answering questions from people who do
not. In my experience most of them use chomp() so the confusion is
preventable. (After all you expect chomp() to get rid of line endings,
right?)

Cheers,
Ben

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

When I said, "I talked to" I meant that. I don't need the pointer - I know
how to handle it. But I wind up answering questions from people who do
not. In my experience most of them use chomp() so the confusion is
preventable. (After all you expect chomp() to get rid of line endings,
right?)

I expect chomp() to remove one and only one terminating instance of the
precise string to which $/ has been set; no more, no less. What were
you expecting?

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From @pudge

At 10.22 -0500 1999.11.19, Ben_Tilly@​trepp.com wrote​:

I have just talked to one too many people who have been bitten by this...

Is there any possibility of having Perl's chomp() command be modified to
recognize \n, \r, and \r\n as line-endings to chomp? A source of nasty
confusion for people working in a cross-platform environment is when
identical Perl scripts give very different results on the same exact file
depending on whether you are running under *nix or Windows. (Particularly
an issue with Samba because people wind up reading under one system files
created under the other.)

Yes, the current behaviour works as documented. But it leads to code not
doing what people expect, and in many cases a confused person will spend
several hours confused...

Well, these sam people will also have a problem with readline and <>. And
I certainly don't want behavior where readline and chomp treat different
things as record separators.

I would like to see, perhaps, a regex IRS, so you could do​:

  $/ = qr/(?​:\015\012?|\012)/;

or whatever. Of course, that is flawed, in that it won't catch the special
(usually broken) case of a file having CR, LF, or CRLF mixed in the same
file. Oh well.

Another solution would involve per-filehandle IRS, where you could call a
function (say, textmode()) that would inspect the filehandle and set the
IRS appropriately for that filehandle. This is more subject to failure for
sockets, though, because it would involve reading, looking at the data, and
then seeking back to the beginning.

I have a prototype of something that tied filehandles to do this, but it
fails with sockets, and doesn't do anything for chomp() anyway (and didn't
work great anyway because of some flaws in tied filehandles and prototypes
... I was using 5.004, I don't know if the flaws have been fixed or
whatnot).

--
Chris Nandor mailto​:pudge@​pobox.com http​://pudge.net/
%PGPKey = ('B76E72AD', [1024, '0824090B CE73CA10 1FF77F13 8180B6B6'])

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

Ben_Tilly@​trepp.com wrote

Is there any possibility of having Perl's chomp() command be modified to
recognize \n, \r, and \r\n as line-endings to chomp?

I hope not. chomp() should match the string in $?, no more or less.

Your problem is not with chomp(). Rather it is with the I/O subsystem.
If you are reading a file as a newline-terminated text file, then
what your Perl code should see is "\n" and nothing else.

You can achieve this with tied filehandles, but I understand that
isn't what you're looking for.

I think the "right" way of doing this is by providing some sort of
filter apparatus on files. Things of this sort were discussed in
the context of unicode. I don't recall where (if anywhere) that
ended.

Mike Guy

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

I think the "right" way of doing this is by providing some sort of
filter apparatus on files.

Well, that makes two of us.

Things of this sort were discussed in
the context of unicode. I don't recall where (if anywhere) that
ended.

Check mjd's summaries?

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

I wrote

I hope not. chomp() should match the string in $?, no more or less.
  $/

Damn shift key.

Mike Guy

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From @TimToady

Tom Christiansen writes​:
: >When I said, "I talked to" I meant that. I don't need the pointer - I know
: >how to handle it. But I wind up answering questions from people who do
: >not. In my experience most of them use chomp() so the confusion is
: >preventable. (After all you expect chomp() to get rid of line endings,
: >right?)
:
: I expect chomp() to remove one and only one terminating instance of the
: precise string to which $/ has been set; no more, no less. What were
: you expecting?

I expect people to expect Perl to do the right thing.

Larry

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From @TimToady

M.J.T. Guy writes​:
: I think the "right" way of doing this is by providing some sort of
: filter apparatus on files. Things of this sort were discussed in
: the context of unicode. I don't recall where (if anywhere) that
: ended.

Ended? It hasn't started yet...

(Can you tell I've spent too much time rewriting the Camel book today? :-)

Yes, input filters should handle this. And a good case can be made
that the *default* input filter should handle it, along with UTF-8
recognition. It also has to be blazing fast, of course, along with
reading your mind. But that's Perl for the coarse. Or something like
that.

Larry

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From [Unknown Contact. See original ticket]

​: I expect chomp() to remove one and only one terminating instance of the
​: precise string to which $/ has been set; no more, no less. What were
​: you expecting?

I expect people to expect Perl to do the right thing.

And that would be what, sniff around the stdio buffer the first time you
play with it and figure out what it smells like?

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 19, 1999

From @TimToady

Tom Christiansen writes​:
: >​: I expect chomp() to remove one and only one terminating instance of the
: >​: precise string to which $/ has been set; no more, no less. What were
: >​: you expecting?
:
: >I expect people to expect Perl to do the right thing.
:
: And that would be what, sniff around the stdio buffer the first time you
: play with it and figure out what it smells like?

Why do you say "you"? Did I say I expect Perl to do the right thing? :-)

Seriously, we are entering an era when dwimmerly action on input will
be a necessary evil. I could wish it were otherwise, but my supply of
divine fiats is low. And I don't think anyone else has enough fiats to
pull it off either. For the near future I only see a chaotic dance
around the UTF-8 strange attractor, in part because a lot of butterflies
are flapping their wings near the UTF-16 attractor instead. We're going
to live in interesting times, whether or not that's an ancient Chinese
curse.

Larry

@p5pRT
Copy link
Author

p5pRT commented Nov 21, 1999

From [Unknown Contact. See original ticket]

Tom Christiansen <tchrist@​jhereg.perl.com> writes​:

​: I expect chomp() to remove one and only one terminating instance of the
​: precise string to which $/ has been set; no more, no less. What were
​: you expecting?

I expect people to expect Perl to do the right thing.

And that would be what, sniff around the stdio buffer the first time you
play with it and figure out what it smells like?

That is far from daft. sv_gets() (the internals of readline) would know
what it had used to find the end of the line. It could leave the
information around for chomp to use.

But the "right thing" is
just to return \n as a logical newline however it was represented in the
buffer (unless in binmode of course). Then chomp'ing \n is fine.

--tom
--
Nick Ing-Simmons

@p5pRT
Copy link
Author

p5pRT commented Nov 21, 1999

From [Unknown Contact. See original ticket]

Nick Ing-Simmons <nick@​ing-simmons.net> writes​:
|| Tom Christiansen <tchrist@​jhereg.perl.com> writes​:
|| >>​: I expect chomp() to remove one and only one terminating instance of the
|| >>​: precise string to which $/ has been set; no more, no less. What were
|| >>​: you expecting?
|| >
|| >>I expect people to expect Perl to do the right thing.
|| >
|| >And that would be what, sniff around the stdio buffer the first time you
|| >play with it and figure out what it smells like?
||
|| That is far from daft. sv_gets() (the internals of readline) would know
|| what it had used to find the end of the line. It could leave the
|| information around for chomp to use.

But as soon as you have a program that opens multiple files of
differing formats, this breaks down. You end up with taint-like
tracing of strings to track which form of file each string came
from.

Which means that...

|| But the "right thing" is
|| just to return \n as a logical newline however it was represented in the
|| buffer (unless in binmode of course). Then chomp'ing \n is fine.

is really a much better choice. That just leaves the issue of
determining the right filtering to do for an output file so that
it matches the input file it is derived from or the target it is
being written to or whatever is the most significant issue -
which the programmer will have to deal with.

--
John Macdonald jmm@​jmm.pickering.elegant.com

@p5pRT
Copy link
Author

p5pRT commented Nov 21, 1999

From @mjdominus

Things of this sort were discussed in the context of unicode. I
don't recall where (if anywhere) that ended.

Check mjd's summaries?

http​://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991114.html#More_About_Line_Disciplines

http​://www.perl.com/pub/1999/11/p5pdigest/THISWEEK-19991107.html#Record_Separators_that_Contain_NUL

My summary of the summaries​:

1. Larry said it would be important to have `line disciplines'
  settable on filehandles, and that it would be important for the
  default ones to be fast.

2. Sam Tregar said he would do it, but I don't know if he will.

3. This is the third week in a row that it has cropped up.

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

John Macdonald <jmm@​elegant.com> writes​:

||
|| That is far from daft. sv_gets() (the internals of readline) would know
|| what it had used to find the end of the line. It could leave the
|| information around for chomp to use.

But as soon as you have a program that opens multiple files of
differing formats, this breaks down. You end up with taint-like
tracing of strings to track which form of file each string came
from.

Yes, the EOLN string would have to be annotated on the SV somewhere
presumably as "magic". Having chomp look for "EOLN magic" on the SV
would be easy to do. The 'set' part of the magic would clear the field.

Which means that...

|| But the "right thing" is
|| just to return \n as a logical newline however it was represented in the
|| buffer (unless in binmode of course). Then chomp'ing \n is fine.

is really a much better choice.

I know ;-) I am delinquent in implementing it.

That just leaves the issue of
determining the right filtering to do for an output file so that
it matches the input file it is derived from or the target it is
being written to or whatever is the most significant issue -
which the programmer will have to deal with.
--
Nick Ing-Simmons <nik@​tiuk.ti.com>
Via, but not speaking for​: Texas Instruments Ltd.

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

Nick Ing-Simmons wrote :
|| John Macdonald <jmm@​elegant.com> writes​:
|| >||
|| >|| That is far from daft. sv_gets() (the internals of readline) would know
|| >|| what it had used to find the end of the line. It could leave the
|| >|| information around for chomp to use.
|| >
|| >But as soon as you have a program that opens multiple files of
|| >differing formats, this breaks down. You end up with taint-like
|| >tracing of strings to track which form of file each string came
|| >from.
||
|| Yes, the EOLN string would have to be annotated on the SV somewhere
|| presumably as "magic". Having chomp look for "EOLN magic" on the SV
|| would be easy to do. The 'set' part of the magic would clear the field.

It gets messier...

  $para = "$file1_lines$file2_lines$file3_lines";

Which of the three EOLN magics gets assigned to $para?

|| >Which means that...
|| >
|| >|| But the "right thing" is
|| >|| just to return \n as a logical newline however it was represented in the
|| >|| buffer (unless in binmode of course). Then chomp'ing \n is fine.
|| >
|| >is really a much better choice.
||
|| I know ;-) I am delinquent in implementing it.
||
|| >That just leaves the issue of
|| >determining the right filtering to do for an output file so that
|| >it matches the input file it is derived from or the target it is
|| >being written to or whatever is the most significant issue -
|| >which the programmer will have to deal with.

--
objects​: | John Macdonald
  Think of them as data with an attitude. | jmm@​elegant.com

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

Tom Christiansen writes​:
: >When I said, "I talked to" I meant that. I don't need the pointer - I
know
: >how to handle it. But I wind up answering questions from people who
do
: >not. In my experience most of them use chomp() so the confusion is
: >preventable. (After all you expect chomp() to get rid of line
endings,
: >right?)
:
: I expect chomp() to remove one and only one terminating instance of the
: precise string to which $/ has been set; no more, no less. What were
: you expecting?

I consider $/ the mechanism through which you can change "line ending" to "paragraph ending" etc...

I expect people to expect Perl to do the right thing.

Ah yes, now I remember why I love this language. :-)

The suggestion that I heard which I most like is letting $/ be an RE. So
you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on
virtually any file. However integrating this logic into the RE engine
could be interesting. After all if you apply the pattern I gave to a
string that ends with \r, it will match even if the next character to be
read is \n. This is manifestly not the right thing to do. :-(

Ben

PS Random note. A random idea a co-worker and I have been throwing around
(based on my massively speeding up a program by doing this) is "lazy
concatentation" of strings. If someone is building up a string through
interpolation and concatentation, it makes sense to internally use
something closer to an array of strings, and then join that into one string
if you ever need to. (print() has no need to join them, the RE engine
does.) This should transparently accelerate a lot of current Perl code...

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

The suggestion that I heard which I most like is letting $/ be an RE. So
you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on
virtually any file. However integrating this logic into the RE engine
could be interesting. After all if you apply the pattern I gave to a
string that ends with \r, it will match even if the next character to be
read is \n. This is manifestly not the right thing to do. :-(

Anything that deviates from the notion of internally representing the
line terminator as a single character (the virtual "\n") is a grave error.

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From @pudge

At 07.54 -0700 1999.11.22, Tom Christiansen wrote​:

The suggestion that I heard which I most like is letting $/ be an RE. So
you can make $/ be /\n|\r\n?/ and it magically "does the right thing" on
virtually any file. However integrating this logic into the RE engine
could be interesting. After all if you apply the pattern I gave to a
string that ends with \r, it will match even if the next character to be
read is \n. This is manifestly not the right thing to do. :-(

Anything that deviates from the notion of internally representing the
line terminator as a single character (the virtual "\n") is a grave error.

I had another idea ... which maybe completely useless, but it is
interesting to think about. If $/ is a regex, then it is only a regex
until it matches the first time, at which point $/ becomes equal to $1. So​:

  $/ = qr/(\015?\012|\015)/;

As soon as it sees \015\012, \012, or \015, $/ becomes whatever it matched.
Again, this would require per-filehandle IRS to be useful. Not advocating,
just throwing it out for fun. :D

--
Chris Nandor mailto​:pudge@​pobox.com http​://pudge.net/
%PGPKey = ('B76E72AD', [1024, '0824090B CE73CA10 1FF77F13 8180B6B6'])

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

John Macdonald <jmm@​elegant.com> writes​:

Nick Ing-Simmons wrote :
|| John Macdonald <jmm@​elegant.com> writes​:
|| >||
|| >|| That is far from daft. sv_gets() (the internals of readline) would know
|| >|| what it had used to find the end of the line. It could leave the
|| >|| information around for chomp to use.
|| >
|| >But as soon as you have a program that opens multiple files of
|| >differing formats, this breaks down. You end up with taint-like
|| >tracing of strings to track which form of file each string came
|| >from.
||
|| Yes, the EOLN string would have to be annotated on the SV somewhere
|| presumably as "magic". Having chomp look for "EOLN magic" on the SV
|| would be easy to do. The 'set' part of the magic would clear the field.

It gets messier...

$para = "$file1_lines$file2_lines$file3_lines";

Which of the three EOLN magics gets assigned to $para?

In my purely fictional implementation none would.

--
Nick Ing-Simmons <nik@​tiuk.ti.com>
Via, but not speaking for​: Texas Instruments Ltd.

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From [Unknown Contact. See original ticket]

Nick Ing-Simmons wrote :
|| John Macdonald <jmm@​elegant.com> writes​:
|| >Nick Ing-Simmons wrote :
|| >|| John Macdonald <jmm@​elegant.com> writes​:
|| >|| >||
|| >|| >|| That is far from daft. sv_gets() (the internals of readline) would know
|| >|| >|| what it had used to find the end of the line. It could leave the
|| >|| >|| information around for chomp to use.
|| >|| >
|| >|| >But as soon as you have a program that opens multiple files of
|| >|| >differing formats, this breaks down. You end up with taint-like
|| >|| >tracing of strings to track which form of file each string came
|| >|| >from.
|| >||
|| >|| Yes, the EOLN string would have to be annotated on the SV somewhere
|| >|| presumably as "magic". Having chomp look for "EOLN magic" on the SV
|| >|| would be easy to do. The 'set' part of the magic would clear the field.
|| >
|| >It gets messier...
|| >
|| > $para = "$file1_lines$file2_lines$file3_lines";
|| >
|| >Which of the three EOLN magics gets assigned to $para?
||
|| In my purely fictional implementation none would.

so then​:

  chomp $file3_lines;
  chomp $para;

could remove different values from two strings with the same
termination value originating from the same source file line, which
would violate the principle of least astonishment for some users.

Good thing this whole issue is fictional. :-)

--
objects​: | John Macdonald
  Think of them as data with an attitude. | jmm@​elegant.com

@p5pRT
Copy link
Author

p5pRT commented Nov 22, 1999

From @samtregar

On Mon, 22 Nov 1999, Mark-Jason Dominus wrote​:

2. Sam Tregar said he would do it, but I don't know if he will.

Unfortunately Sam Tregar is just a novice perl hacker! I've been poking
around a bit but I'm not convinced I've even found the right place to
start working yet.

If this is a high-priority item, perhaps someone more experienced should
consider giving it a try.

-sam

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2005

From @schwern

[Ben_Tilly@​trepp.com - Thu Nov 18 23​:18​:09 1999]​:

Is there any possibility of having chomp() be modified to recognize
\n, \r,
and \r\n as line-endings to chomp?

Do you mean that chomp(), rather than being equivalent to​:

  s{$/\z}{};

should be​:

  s{(\r|\n|\r\n)\z}{};

?

@p5pRT
Copy link
Author

p5pRT commented Jul 13, 2005

The RT System itself - Status changed from 'stalled' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2005

From schubiger@cpan.org

[Ben_Tilly@​trepp.com - Thu Nov 18 23​:18​:09 1999]​:

Is there any possibility of having chomp() be modified to recognize \n, \r,
and \r\n as line-endings to chomp? A source of nasty confusion for people
working in a cross-platform environment is when the same exact script gives
very different results on the same exact file depending on whether you are
running under *nix or Windows. (Particularly an issue with Samba because
people wind up reading under one system files created under the other.)

The record separator that is defaulted to '\n' on Unix, will change
depending on the operating system Perl is running on - chomp() relies
heavily upon the value of $/. Chomping native files shouldn't cause much
noise, whereas chomping 'foreign' files with differing file-endings
would require that you localize $/ in the scope of operation.

Example​:
{
  local $/ = "\r\n";
  $chomped = chomp(@​lines);
}

Yes, the current behaviour works as documented. But it leads to code not
doing what people expect, and a confused person can easily spend several
hours confused...

I'd say, it's rather clearly documented, without lack of accurate
description. Although the behaviour requested is desirable, it doesn't
seem possible to integrate the inevitable changes to doop.c​:Perl_do_chomp,
where the record separator, known as global PL_rs, is extensively
utilized and relied upon - allowing for multiple values would require
tremendous changes and furthermore, would likely break backwards
compatibility.

@p5pRT
Copy link
Author

p5pRT commented Nov 20, 2006

From @rgs

Rejected, mostly for backwards compatibility reasons.

@p5pRT
Copy link
Author

p5pRT commented Nov 20, 2006

@rgs - Status changed from 'open' to 'rejected'

@p5pRT p5pRT closed this as completed Nov 20, 2006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant