Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOM flag should be removed if open file with UTF-8 #3725

Closed
p6rt opened this issue Mar 9, 2015 · 6 comments
Closed

BOM flag should be removed if open file with UTF-8 #3725

p6rt opened this issue Mar 9, 2015 · 6 comments

Comments

@p6rt
Copy link

p6rt commented Mar 9, 2015

Migrated from rt.perl.org#124024 (status was 'resolved')

Searchable as RT124024$

@p6rt
Copy link
Author

p6rt commented Oct 27, 2014

From @FROGGS

Taken from​: http://irclog.perlgeek.de/perl6/2014-10-27#i_9574112

bronco_creek​: I have a question about reading from a file. When I use, for example​: for $in.lines -​: $line {say $line} is see a "?" prepended to the first line. What's up with that?
FROGGS​: that sounds like a replacement character..
geekosaur​: windows by any chance? a utf8 file with a byte order mark at the start (like windows text editors like to prepend), sent to a non-utf8 display, would do that
FROGGS​: bronco_creek​: does that file start with a BOM?
bronco_creek​: Yes, Windows.
FROGGS​: bronco_creek​: you've got notepad++ by any chance?
bronco_creek​: I'm using Padre.
FROGGS​: ahh, hmmm
FROGGS​: dunno if that allows you to save that file without a BOM
bronco_creek​: The file I'm trying to process is xml output from another program, but the leading "?" does not show up in Padre.
timotimo​: do you have some kind of hex editor?
timotimo​: now the question becomes
timotimo​: do we want to support byte order marks in rakudo?
geekosaur​: I think you'll need to; notepad in at least some versions of windows will insist on prepending a BOM
geekosaur​: some other text editors as well (and there's a handful of programs, including some from Microsoft, that depend on it!)
geekosaur​: even though you're not supposed to use a BOM with UTF-8
bronco_creek​: I tried to strip the "?" with ~~ s/^\?// but that did not do the trick.
geekosaur​: yes, it's getting mapped to a replacement character on output; it's not actually a question mark
timotimo​: that's right
timotimo​: how about opening the file as binary and outputting the first few bytes as hexadecimals and looking at what exactly it is
geekosaur​: actually on display, not on output
bronco_creek​: Hmm. Drove me nut trying to figure out why all of my xml files were invalid when I tried to parse with XML or XML​::Parser​::Tiny
timotimo​: damn, that's annoying
timotimo​: very sorry to hear that, bronco_creek :(

@p6rt
Copy link
Author

p6rt commented Mar 9, 2015

From wbiker@gmx.at

Hi,

On Windows, the Notepad.exe saves the Byte Order Mark (BOM) at the begin of
the text. Perl6 doe s not filter this BOM if open the file with UTF-8 and
without :bin set.

The user has to handle that, which is anoying and not intuitive.

thanks

@p6rt
Copy link
Author

p6rt commented Mar 9, 2015

From @smls

For the record this was recently discussed on IRC starting at <http://irclog.perlgeek.de/perl6/2015-03-06#i_10234453>. Excerpt of constructive comments​:

moritz​: open() could have a :strip-bom option or so
  you could pass :strip-bom even on linux

perl6_newbee​: I would like it the other way. To have BOM striped other the :dont-strip-bom attribute is set

PerlJam what if it's *not* a BOM, but looks like one?

moritz​: .u FEFF
yoleaux​: U+FEFF ZERO WIDTH NO-BREAK SPACE [Cf] (<control>)
moritz​: well, the BOM is also a valid zero-widht non-breaking space

jnthn​: If we implemnt this, then it belongs, imo, in the UTF-8 decoding handling
  I don't think there's a problem with most sane Windows programs if you don't write out a BOM.
  But on the whole I suspect not tolerating one when reading is going to just create a lot of questions
  While tolerating it is unlikely to burn anybody if it's implemented in the correct place.

TimToady​: +1 to always stripping BOM on textual input

@p6rt
Copy link
Author

p6rt commented Mar 9, 2015

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Oct 6, 2015

From @jnthn

On Mon Mar 09 06​:03​:21 2015, smls75@​gmail.com wrote​:

For the record this was recently discussed on IRC starting at
<http://irclog.perlgeek.de/perl6/2015-03-06#i_10234453>. Excerpt of
constructive comments​:

moritz​: open() could have a :strip-bom option or so
you could pass :strip-bom even on linux

perl6_newbee​: I would like it the other way. To have BOM striped
other the :dont-strip-bom attribute is set

PerlJam what if it's *not* a BOM, but looks like one?

moritz​: .u FEFF
yoleaux​: U+FEFF ZERO WIDTH NO-BREAK SPACE [Cf] (<control>)
moritz​: well, the BOM is also a valid zero-widht non-breaking
space

jnthn​: If we implemnt this, then it belongs, imo, in the UTF-8
decoding handling
I don't think there's a problem with most sane Windows
programs if you don't write out a BOM.
But on the whole I suspect not tolerating one when
reading is going to just create a lot of questions
While tolerating it is unlikely to burn anybody if it's
implemented in the correct place.

TimToady​: +1 to always stripping BOM on textual input

Now implemented in Moar's UTF-8 decoder, and covered by S16-io/bom.t in the spectests.

@p6rt p6rt closed this as completed Oct 6, 2015
@p6rt
Copy link
Author

p6rt commented Oct 6, 2015

@jnthn - Status changed from 'open' to 'resolved'

@p6rt p6rt added the suggestion label Jan 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant