BOM flag should be removed if open file with UTF-8 #3725

p6rt · 2015-03-09T07:22:52Z

Migrated from rt.perl.org#124024 (status was 'resolved')

Searchable as RT124024$

p6rt · 2014-10-27T23:03:39Z

From @FROGGS

Taken from: http://irclog.perlgeek.de/perl6/2014-10-27#i_9574112

bronco_creek: I have a question about reading from a file. When I use, for example: for $in.lines -: $line {say $line} is see a "?" prepended to the first line. What's up with that?
FROGGS: that sounds like a replacement character..
geekosaur: windows by any chance? a utf8 file with a byte order mark at the start (like windows text editors like to prepend), sent to a non-utf8 display, would do that
FROGGS: bronco_creek: does that file start with a BOM?
bronco_creek: Yes, Windows.
FROGGS: bronco_creek: you've got notepad++ by any chance?
bronco_creek: I'm using Padre.
FROGGS: ahh, hmmm
FROGGS: dunno if that allows you to save that file without a BOM
bronco_creek: The file I'm trying to process is xml output from another program, but the leading "?" does not show up in Padre.
timotimo: do you have some kind of hex editor?
timotimo: now the question becomes
timotimo: do we want to support byte order marks in rakudo?
geekosaur: I think you'll need to; notepad in at least some versions of windows will insist on prepending a BOM
geekosaur: some other text editors as well (and there's a handful of programs, including some from Microsoft, that depend on it!)
geekosaur: even though you're not supposed to use a BOM with UTF-8
bronco_creek: I tried to strip the "?" with ~~ s/^\?// but that did not do the trick.
geekosaur: yes, it's getting mapped to a replacement character on output; it's not actually a question mark
timotimo: that's right
timotimo: how about opening the file as binary and outputting the first few bytes as hexadecimals and looking at what exactly it is
geekosaur: actually on display, not on output
bronco_creek: Hmm. Drove me nut trying to figure out why all of my xml files were invalid when I tried to parse with XML or XML::Parser::Tiny
timotimo: damn, that's annoying
timotimo: very sorry to hear that, bronco_creek :(

p6rt · 2015-03-09T07:22:52Z

From wbiker@gmx.at

Hi,

On Windows, the Notepad.exe saves the Byte Order Mark (BOM) at the begin of
the text. Perl6 doe s not filter this BOM if open the file with UTF-8 and
without :bin set.

The user has to handle that, which is anoying and not intuitive.

thanks

p6rt · 2015-03-09T13:03:21Z

From @smls

For the record this was recently discussed on IRC starting at <http://irclog.perlgeek.de/perl6/2015-03-06#i_10234453>. Excerpt of constructive comments:

moritz: open() could have a :strip-bom option or so
you could pass :strip-bom even on linux

perl6_newbee: I would like it the other way. To have BOM striped other the :dont-strip-bom attribute is set

PerlJam what if it's *not* a BOM, but looks like one?

moritz: .u FEFF
yoleaux: U+FEFF ZERO WIDTH NO-BREAK SPACE [Cf] (<control>)
moritz: well, the BOM is also a valid zero-widht non-breaking space

jnthn: If we implemnt this, then it belongs, imo, in the UTF-8 decoding handling
I don't think there's a problem with most sane Windows programs if you don't write out a BOM.
But on the whole I suspect not tolerating one when reading is going to just create a lot of questions
While tolerating it is unlikely to burn anybody if it's implemented in the correct place.

TimToady: +1 to always stripping BOM on textual input

p6rt · 2015-03-09T13:03:21Z

The RT System itself - Status changed from 'new' to 'open'

p6rt · 2015-10-06T11:05:57Z

From @jnthn

On Mon Mar 09 06:03:21 2015, smls75@gmail.com wrote:

For the record this was recently discussed on IRC starting at
<http://irclog.perlgeek.de/perl6/2015-03-06#i_10234453>. Excerpt of
constructive comments:

moritz: open() could have a :strip-bom option or so
you could pass :strip-bom even on linux

perl6_newbee: I would like it the other way. To have BOM striped
other the :dont-strip-bom attribute is set

PerlJam what if it's *not* a BOM, but looks like one?

moritz: .u FEFF
yoleaux: U+FEFF ZERO WIDTH NO-BREAK SPACE [Cf] (<control>)
moritz: well, the BOM is also a valid zero-widht non-breaking
space

jnthn: If we implemnt this, then it belongs, imo, in the UTF-8
decoding handling
I don't think there's a problem with most sane Windows
programs if you don't write out a BOM.
But on the whole I suspect not tolerating one when
reading is going to just create a lot of questions
While tolerating it is unlikely to burn anybody if it's
implemented in the correct place.

TimToady: +1 to always stripping BOM on textual input

Now implemented in Moar's UTF-8 decoder, and covered by S16-io/bom.t in the spectests.

p6rt · 2015-10-06T11:05:59Z

@jnthn - Status changed from 'open' to 'resolved'

p6rt closed this as completed Oct 6, 2015

p6rt added the suggestion label Jan 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BOM flag should be removed if open file with UTF-8 #3725

BOM flag should be removed if open file with UTF-8 #3725

p6rt commented Mar 9, 2015

p6rt commented Oct 27, 2014

p6rt commented Mar 9, 2015

p6rt commented Mar 9, 2015

p6rt commented Mar 9, 2015

p6rt commented Oct 6, 2015

p6rt commented Oct 6, 2015

BOM flag should be removed if open file with UTF-8 #3725

BOM flag should be removed if open file with UTF-8 #3725

Comments

p6rt commented Mar 9, 2015

p6rt commented Oct 27, 2014

From @FROGGS

p6rt commented Mar 9, 2015

From wbiker@gmx.at

p6rt commented Mar 9, 2015

From @smls

p6rt commented Mar 9, 2015

p6rt commented Oct 6, 2015

From @jnthn

p6rt commented Oct 6, 2015