Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seeking on bytes causes broken perl strings #11697

Open
p5pRT opened this issue Oct 14, 2011 · 15 comments
Open

seeking on bytes causes broken perl strings #11697

p5pRT opened this issue Oct 14, 2011 · 15 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 14, 2011

Migrated from rt.perl.org#101382 (status was 'open')

Searchable as RT101382$

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

From tchrist@perl.com

Perl's seek and sysseek take off_t arguments and retvals, which are in
bytes. But you can call them on a stream with an encoding. Now you're
doomed, recause all read-like functions in perl (getc, readline, read,
sysread) go through the encoding layer. That means you can see to partway
through a multibyte UTF-8 or UTF-16 character (for example) and when you
next read something, and you just produced a broken UTF-8 string in Perl.
I'm pretty sure that that is supposed to be Against The Rules.

  % perl -CS -E 'say "E\x{301}" x 50 for 1..100' > sample.utf8

  % cat sysseek-enc-test
  #!/usr/bin/env perl
  # sysseek-enc-test
  use v5.14;
  use strict;
  use warnings;
  use open qw(​:std :utf8);
  use Fcntl qw(​:seek);
  use Devel​::Peek;
  my $encoding = "utf8"; # same results w/ "encoding(UTF-8)"
  my $mode = "< :$encoding";
  @​ARGV == 3 || die "usage​: $0 utf8filename offset count";
  my($filename, $offset, $count) = @​ARGV;
  $offset =~ /^\d+$/aa || die "offset should be whole number";
  $count =~ /^\d+$/aa || die "count should be whole number";
  open(my $fh, $mode, $filename) || die "$0​: can't open $mode $filename​: $!\n";
  my $newpos = sysseek($fh, $offset, SEEK_SET) // die "$0​: sysseek failed​: $!\n";
  my $sysret = sysread($fh, my $buf, $count);
  $sysret == $count || die "$0​: only sysread $sysret not $count chars​: $!";
  print "sysread worked, trying print and dump...\n";
  printf "%d chars from offset %d are length %d​:",
  $count, $offset, length($buf);
  printf "<%s>, U+%v04X\n", $buf, $buf;
  Dump($buf);

  % perl sysseek-enc sample.utf8 2 4
  sysread worked, trying print and dump...
  Malformed UTF-8 character (unexpected continuation byte 0x81, with no preceding start byte) in printf at sysseek-enc line 22.
  4 chars from offset 2 are length 4​:<#ÉE>, U+0000.0045.0301.0045
  SV = PVMG(0x3c036640) at 0x3c0657b4
  REFCNT = 1
  FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x3c026548 "\201E\314\201E"\0Malformed UTF-8 character (unexpected continuation byte 0x81, with no preceding start byte) in subroutine entry at sysseek-enc line 23.
  [UTF8 "\x{0}E\x{301}E"]
  CUR = 5
  LEN = 12
  MAGIC = 0x3c049be0
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = 4

Notice that the data produced claim to have an initial code point of
U+0000. But that isn't so​:

  % head -1 sample.utf8 | uniquote -x | chop -72
  E\x{301}E\x{301}E\x{301}E\x{301}E\x{301}E\x{301}E\x{301}E\x{301}E\x{301
  % head -1 sample.utf8 | uniquote -b | chop -72
  E\xCC\x81E\xCC\x81E\xCC\x81E\xCC\x81E\xCC\x81E\xCC\x81E\xCC\x81E\xCC\x8
  % head -1 sample.utf8 | uniquote -o | chop -72
  E\314\201E\314\201E\314\201E\314\201E\314\201E\314\201E\314\201E\314\20

The problem doesn't change if you :%s/sys//g the previous program​:

  % cat seek-enc-test
  #!/usr/bin/env perl
  # seek-enc-test
  use v5.14;
  use strict;
  use warnings;
  use open qw(​:std :utf8);
  use Fcntl qw(​:seek);
  use Devel​::Peek;
  my $encoding = "utf8"; # same results w/ "encoding(UTF-8)"
  my $mode = "< :$encoding";
  @​ARGV == 3 || die "usage​: $0 utf8filename offset count";
  my($filename, $offset, $count) = @​ARGV;
  $offset =~ /^\d+$/aa || die "offset should be whole number";
  $count =~ /^\d+$/aa || die "count should be whole number";
  open(my $fh, $mode, $filename) || die "$0​: can't open $mode $filename​: $!\n";
  my $newpos = seek($fh, $offset, SEEK_SET) // die "$0​: seek failed​: $!\n";
  my $ret = read($fh, my $buf, $count);
  $ret == $count || die "$0​: only read $ret not $count chars​: $!";
  print "read worked, trying print and dump...\n";
  printf "%d chars from offset %d are length %d​:",
  $count, $offset, length($buf);
  printf "<%s>, U+%v04X\n", $buf, $buf;
  Dump($buf);
 
  % perl seek-enc sample.utf8 2 4
  read worked, trying print and dump...
  Malformed UTF-8 character (unexpected continuation byte 0x81, with no preceding start byte) in printf at seek-enc line 22.
  4 chars from offset 2 are length 4​:<#ÉE>, U+0000.0045.0301.0045
  SV = PVMG(0x3c036640) at 0x3c065fb4
  REFCNT = 1
  FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x3c046880 "\201E\314\201E"\0Malformed UTF-8 character (unexpected continuation byte 0x81, with no preceding start byte) in subroutine entry at seek-enc line 23.
  [UTF8 "\x{0}E\x{301}E"]
  CUR = 5
  LEN = 12
  MAGIC = 0x3c0493e0
  MG_VIRTUAL = &PL_vtbl_utf8
  MG_TYPE = PERL_MAGIC_utf8(w)
  MG_LEN = 4

Ok, now what?

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:
 
  Platform​:
  osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
  uname='openbsd chthon 4.4 generic#0 i386 '
  config_args='-des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=y, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-lgdbm -lm -lutil -lc
  perllibs=-lm -lutil -lc
  libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
  cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:
  Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
  USE_PERL_ATOF
  Built under openbsd
  Compiled at Jun 11 2011 11​:48​:28
  %ENV​:
  PERL_UNICODE="SA"
  @​INC​:
  /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/site_perl/5.14.0
  /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/5.14.0
  /usr/local/lib/perl5/site_perl/5.12.3
  /usr/local/lib/perl5/site_perl/5.11.3
  /usr/local/lib/perl5/site_perl/5.10.1
  /usr/local/lib/perl5/site_perl/5.10.0
  /usr/local/lib/perl5/site_perl/5.8.7
  /usr/local/lib/perl5/site_perl/5.8.0
  /usr/local/lib/perl5/site_perl/5.6.0
  /usr/local/lib/perl5/site_perl/5.005
  /usr/local/lib/perl5/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

From @doy

On Fri, Oct 14, 2011 at 03​:56​:49PM -0700, tchrist1 wrote​:

Perl's seek and sysseek take off_t arguments and retvals, which are in
bytes. But you can call them on a stream with an encoding. Now you're
doomed, recause all read-like functions in perl (getc, readline, read,
sysread) go through the encoding layer. That means you can see to partway
through a multibyte UTF-8 or UTF-16 character (for example) and when you
next read something, and you just produced a broken UTF-8 string in Perl.
I'm pretty sure that that is supposed to be Against The Rules.

For what it's worth (not to say that I agree with the current behavior),
this is documented in perldoc -f seek​:

  Note the in bytes​: even if the filehandle has been set to operate on
  characters (for example by using the "​:encoding(utf8)" open layer),
  tell() will return byte offsets, not character offsets (because
  implementing that would render seek() and tell() rather slow).

I guess the idea is that in order to seek to some location in the middle
of the file, you would have to parse and decode the entire file from the
beginning up until the point you're seeking to on every call to seek,
which could potentially be unusably slow depending on the size of the
file. I don't know how hard getting around this would be.

-doy

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Oct 14, 2011

From tchrist@perl.com

For what it's worth (not to say that I agree with the current behavior),
this is documented in perldoc -f seek​:

Yes, I know. It was sysseek and sysread where I got into the problem.
Of course it makes sense, but sysread makes too many mumbles about bytes.
It should talk about charcacters, since that's what it reads unless it's
opened in binary mode. This is a holdover from the confusion about
:raw vs :crlf vs :utf8 vs :bytes, and which of those are actually opposed
to each other.

In the meanwhile, I suggest that this at the least be added to sysseek​:

  B<WARNING>​: I<POSITION> is in bytes not characters, no matter whether there
  should happen to be an encoding layer on the filehandle or not. However,
  all functions in Perl that read or write handles I<do> go through any encoding
  layer, and you can therefore read a partial "character" and wind up with
  an invalid Perl string. Avoid mixing calls to C<sysseek> or C<seek> with
  I/O functions on filehandles with multibyte encoding layers.

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @deven

On Fri, Oct 14, 2011 at 7​:19 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

In the meanwhile, I suggest that this at the least be added to sysseek​:

B<WARNING>​: I<POSITION> is in bytes not characters, no matter whether
there
should happen to be an encoding layer on the filehandle or not. However,
all functions in Perl that read or write handles I<do> go through any
encoding
layer, and you can therefore read a partial "character" and wind up with
an invalid Perl string. Avoid mixing calls to C<sysseek> or C<seek>
with
I/O functions on filehandles with multibyte encoding layers.

Perhaps it would be better for the warning to talk about the dangers of
seeking to a byte offset which is not at a character boundary. After all,
if someone seeks an odd number of bytes into UTF-16 data, it will corrupt
the data from that point forward, right? The corruption from a bad seek is
not necessarily limited to the first character.

I would propose adding special-case logic for UTF-8 for the following
reasons​:

* UTF-8 has variable-length character encodings, so it's impossible to
predict a character boundary without knowing what characters came before.
* UTF-8 is the preferred encoding and increasingly common.
* Seeking to the middle of a multibyte character encoding shouldn't happen
when using a previous tell location, but can quite easily occur otherwise.
* Returning a malformed string as a result of reading a correctly-encoded
UTF-8 data stream is highly undesirable.
* UTF-8 is self-synchronizing, and continuation bytes are unambiguous.

My proposed solution is simple​: On the first character-oriented read after a
byte-oriented seek into UTF-8 data, silently ignore up to 5 continuation
bytes instead of returning a malformed string. This wouldn't have the
performance implications of seeking by character offsets, but would easily
allow seeking into the middle of a large amount of UTF-8 data without
generating spurious phantom character encoding errors from seeking to the
wrong byte. A warning about the ignored bytes could be useful, but probably
rarely, so it should probably be off unless explicitly requested.

Wouldn't this be better than creating corrupted characters in strings that
didn't actually exist in the source data?

Deven

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @ap

* Deven T. Corzine <deven@​ties.org> [2011-10-17 16​:35]​:

My proposed solution is simple​: On the first character-oriented read
after a byte-oriented seek into UTF-8 data, silently ignore up to
5 continuation bytes instead of returning a malformed string. This
wouldn't have the performance implications of seeking by character
offsets, but would easily allow seeking into the middle of a large
amount of UTF-8 data without generating spurious phantom character
encoding errors from seeking to the wrong byte. A warning about the
ignored bytes could be useful, but probably rarely, so it should
probably be off unless explicitly requested.

Wouldn't this be better than creating corrupted characters in strings
that didn't actually exist in the source data?

That will make the read “work” in that it won’t complain and won’t
produce corrupt data, but in what practical scenario is this useful?

“Sometimes your read will swallow the first character in the data. We
won’t tell you though. In fact we’ll make sure you can’t notice even
if you wanted to.”

What is the sense of that?

At the very least if this is done then you’d have to correspondingly
over-read at the end of a read if it ends in the middle of a multibyte
character – to avoid losing the data all together. (And it’s not
a surefire way to avoid the loss, just a way to avoid surefire loss.)
But the predictable low-level semantics of byte seeks are utterly
destroyed in the process.

Regards,
--
Aristotle Pagaltzis // <http​://plasmasturm.org/>

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From tchrist@perl.com

"A. Pagaltzis via RT" <perlbug-followup@​perl.org> wrote
  on Mon, 17 Oct 2011 09​:04​:07 PDT​:

That will make the read “work” in that it won’t complain and won’t
produce corrupt data, but in what practical scenario is this useful?

“Sometimes your read will swallow the first character in the data. We
won’t tell you though. In fact we’ll make sure you can’t notice even
if you wanted to.”

What is the sense of that?

Good point.

I agree that it seems dodgy. Perl is pretty careful not to go silently
destroying data, and I'd like it to remain so. I also wouldn't want
break {sys,}seek from working when given a legitimate tell address.

I just don't know what the devil to do with reads producing broken strings
after seek has been given a tell address that is *not* a character
boundary when you're working with an auto-decoded stream. I'm a bit
surprised that broken internal strings aren't fatals, actually. Hm?

My stomach has been somewhat unsettled of late, so I don't know that I
should necessarily trust its instincts, but my gut feel that Perl should
never let you be able to produce a broken Perl string. Modulo intentional
malice with the low-level stuff, of course, but I don't see seek and read
as being in that bucket. This is also what makes me feel that $/ = \INT
producing broken Perl strings is also inherently flawed. I think Nick once
observed that we mustn't let those happen.

I know how the super-duper-over-object-oriented languages would "fix"
this​: they'd create 52 levels of strictly typed I/O wrapper classes and
only allow you to go through their APIs to get at things. That goes
against the Unix/C mentality of allowing low-level access to the real
and very simple system calls when such is asked for, so I don't want to
even start to go there.

But what to do about these malformed characters, eh?

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @cpansprout

On Mon Oct 17 09​:18​:48 2011, tom christiansen wrote​:

"A. Pagaltzis via RT" <perlbug-followup@​perl.org> wrote
on Mon, 17 Oct 2011 09​:04​:07 PDT​:

That will make the read “work” in that it won’t complain and won’t
produce corrupt data, but in what practical scenario is this useful?

“Sometimes your read will swallow the first character in the data. We
won’t tell you though. In fact we’ll make sure you can’t notice even
if you wanted to.”

What is the sense of that?

Good point.

I agree that it seems dodgy. Perl is pretty careful not to go silently
destroying data, and I'd like it to remain so. I also wouldn't want
break {sys,}seek from working when given a legitimate tell address.

I just don't know what the devil to do with reads producing broken strings
after seek has been given a tell address that is *not* a character
boundary when you're working with an auto-decoded stream. I'm a bit
surprised that broken internal strings aren't fatals, actually. Hm?

My stomach has been somewhat unsettled of late, so I don't know that I
should necessarily trust its instincts, but my gut feel that Perl should
never let you be able to produce a broken Perl string. Modulo intentional
malice with the low-level stuff, of course, but I don't see seek and read
as being in that bucket. This is also what makes me feel that $/ = \INT
producing broken Perl strings is also inherently flawed. I think Nick
once
observed that we mustn't let those happen.

I know how the super-duper-over-object-oriented languages would "fix"
this​: they'd create 52 levels of strictly typed I/O wrapper classes and
only allow you to go through their APIs to get at things. That goes
against the Unix/C mentality of allowing low-level access to the real
and very simple system calls when such is asked for, so I don't want to
even start to go there.

But what to do about these malformed characters, eh?

Either
a) croak, or
b) fudge things, as was suggested above, but with a default (formerly
known as mandatory) warning.

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From tchrist@perl.com

"Father Chrysostomos via RT" <perlbug-followup@​perl.org> wrote
  on Mon, 17 Oct 2011 09​:34​:44 PDT​:

But what to do about these malformed characters, eh?

Either
a) croak, or
b) fudge things, as was suggested above, but with a default
(formerly known as mandatory) warning.

Speaking of mandatory warnings, I was recently bitten by there not being
one. An old perl3 script that I haven't look at in over 20 years but use
far too often silently stopped working, and I never noticed (this proved a
bad thing). It was implicitly splitting into @​_. There was no default
warning created when that was removed from Perl. Shouldn't there have
been? How come it didn't get a [D deprecated, syntax] warning?

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @ikegami

On Mon, Oct 17, 2011 at 12​:40 PM, Tom Christiansen <tchrist@​perl.com> wrote​:

"Father Chrysostomos via RT" <perlbug-followup@​perl.org> wrote
on Mon, 17 Oct 2011 09​:34​:44 PDT​:

But what to do about these malformed characters, eh?

Either
a) croak, or
b) fudge things, as was suggested above, but with a default
(formerly known as mandatory) warning.

Speaking of mandatory warnings, I was recently bitten by there not being
one. An old perl3 script that I haven't look at in over 20 years but use
far too often silently stopped working, and I never noticed (this proved a
bad thing). It was implicitly splitting into @​_. There was no default
warning created when that was removed from Perl. Shouldn't there have
been? How come it didn't get a [D deprecated, syntax] warning?

Because you used a version of Perl from after the depreciation cycle?

If you don't test your script with every major version, you can miss a
depreciation warning.

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @ikegami

On Mon, Oct 17, 2011 at 1​:01 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

If you don't test your script with every major version, you can miss a
depreciation warning.

argh, I've seen deprecation misspelled so often, I started doing it!

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From tchrist@perl.com

Because you used a version of Perl from after the depreciation cycle?

If you don't test your script with every major version, you can miss a
depreciation warning.

I did? Really? I have 6/8/10/12/14 all installed. I wonder how
I missed it?

Was implicit split a "mandatory" (=default) warning at some point?

Does this mean we plan to soon stop warning people about $* not working anymore?

--tom

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2011

From @cpansprout

On Mon Oct 17 10​:08​:54 2011, tom christiansen wrote​:

Because you used a version of Perl from after the depreciation
cycle?

If you don't test your script with every major version, you can miss
a
depreciation warning.

I did? Really? I have 6/8/10/12/14 all installed. I wonder how
I missed it?

Was implicit split a "mandatory" (=default) warning at some point?

No. Deprecation warnings became default warnings in 5.12. 5.12 removed
implicit split, but failed to mention it in perldelta. I’d still like
to see in reinstated, at least in void context.

(Shameless plug​: If you slap ‘use Classic​::Perl;’ on all your old
scripts, you shouldn’t have so much trouble upgrading. :-)

Does this mean we plan to soon stop warning people about $* not
working anymore?

I doubt it, but you can say ** or &* or %* before mentioning $*, and the
warning will disappear. (Classic​::Perl relies on that implementation
detail.)

@p5pRT
Copy link
Author

p5pRT commented Oct 19, 2011

From @deven

On Mon, Oct 17, 2011 at 12​:03 PM, Aristotle Pagaltzis <pagaltzis@​gmx.de>wrote​:

* Deven T. Corzine <deven@​ties.org> [2011-10-17 16​:35]​:

My proposed solution is simple​: On the first character-oriented read
after a byte-oriented seek into UTF-8 data, silently ignore up to
5 continuation bytes instead of returning a malformed string. This
wouldn't have the performance implications of seeking by character
offsets, but would easily allow seeking into the middle of a large
amount of UTF-8 data without generating spurious phantom character
encoding errors from seeking to the wrong byte. A warning about the
ignored bytes could be useful, but probably rarely, so it should
probably be off unless explicitly requested.

Wouldn't this be better than creating corrupted characters in strings
that didn't actually exist in the source data?

That will make the read “work” in that it won’t complain and won’t
produce corrupt data, but in what practical scenario is this useful?

In the scenario where you want to seek into the middle of UTF-8 data, but
don't know with certainty exactly where the character boundaries are to be
able to give a valid byte offset.

“Sometimes your read will swallow the first character in the data. We
won’t tell you though. In fact we’ll make sure you can’t notice even
if you wanted to.”

What is the sense of that?

A more accurate description would be "if you seek into the middle of a
character and try to read characters, the data returned will start with the
first valid character instead of returning a corrupted character from the
inaccurate seek." If you're reading from a byte offset that isn't where a
character begins, and you're reading a character-oriented stream, not a
byte-oriented stream, what's the sense of returning a bogus malformed
character that doesn't actually exist in the source data?

At the very least if this is done then you’d have to correspondingly

over-read at the end of a read if it ends in the middle of a multibyte
character – to avoid losing the data all together. (And it’s not
a surefire way to avoid the loss, just a way to avoid surefire loss.)
But the predictable low-level semantics of byte seeks are utterly
destroyed in the process.

I don't follow your logic here. This isn't about using byte-oriented buffer
reads for character data, this is about the semantics of mixing
byte-oriented seeks with character-oriented reads. How does over-reading
apply here?

Deven

@p5pRT
Copy link
Author

p5pRT commented Oct 31, 2011

From perl-diddler@tlinx.org

Deven T. Corzine wrote​:

I would propose adding special-case logic for UTF-8 for the following
reasons​:

* UTF-8 is self-synchronizing, and continuation bytes are unambiguous.

My proposed solution is simple​: On the first character-oriented read
after a byte-oriented seek into UTF-8 data, silently ignore up to 5
continuation bytes instead of returning a malformed string.


I would use a more robust approach.

Go BACK 'n' bytes (where 'n' is the maximum number of bytes needed to
resync a UTF-8 byte
string.

Then go forward and return the file seek value as the position before
the first incomplete
character closest to the seek boundary they asked for. That way you
will get "up to n" bytes
"sought"(?) through by seek but it won't stop in the middle and won't
lost data if you read
from that position forward.

???

Not against other optional warnings and 'strict' approaches, but it
seems like for
perl to "do what I mean", and "do the right thing" -- core perl design
goals that are being
increasingly overruled by FUD, it seems like it would be a fairly useful
and reasonable thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants