Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

funny results from ~ on non-Latin-1 string #9668

Closed
p5pRT opened this issue Mar 1, 2009 · 9 comments
Closed

funny results from ~ on non-Latin-1 string #9668

p5pRT opened this issue Mar 1, 2009 · 9 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 1, 2009

Migrated from rt.perl.org#63574 (status was 'resolved')

Searchable as RT63574$

@p5pRT
Copy link
Author

p5pRT commented Mar 1, 2009

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -lwe 'print ord(~"\x{aaa}")'
4294964565
$

~ is documented to operate on integers or bit strings, but "\x{aaa}"
is neither. Empirically, if ~ is applied to a string containing at least
one non-octet, the result is a string of the same length, where each
codepoint is equal to 0xffffffff minus an input codepoint. That could
conceivably be a useful operation, if it could be consistently applied
to all strings, but if all input codepoints are 0xff or below then ~
negates octets rather than 32-bit codepoints. ~~$_ eq $_ is broken for
values such as "\x{ffffffaa}".

I think ~ on a string that is neither a number nor an octet string should
throw an exception, because there is nothing meaningful that it can do.
If not, then it should at least warn.

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by Debian Project at Thu Jan  1 12:43:38 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.6.26-1-686, archname=i486-linux-gnu-thread-multi
    uname='linux rebekka 2.6.26-1-686 #1 smp mon dec 15 18:15:07 utc 2008 i686 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.3.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib /usr/lib64
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=/lib/libc-2.7.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
    gnulibc_version='2.7'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:
    


@INC for perl 5.10.0:
    /etc/perl
    /usr/local/lib/perl/5.10.0
    /usr/local/share/perl/5.10.0
    /usr/lib/perl5
    /usr/share/perl5
    /usr/lib/perl/5.10
    /usr/share/perl/5.10
    /usr/local/lib/site_perl
    .


Environment for perl 5.10.0:
    HOME=/home/zefram
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/zefram/usr/perl/util:/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games
    PERL_BADLANG (unset)
    SHELL=/usr/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

From @ysth

Zefram wrote​:

~ is documented to operate on integers or bit strings, but "\x{aaa}"
is neither. Empirically, if ~ is applied to a string containing at least
one non-octet, the result is a string of the same length, where each
codepoint is equal to 0xffffffff minus an input codepoint. That could
conceivably be a useful operation, if it could be consistently applied
to all strings, but if all input codepoints are 0xff or below then ~
negates octets rather than 32-bit codepoints. ~~$_ eq $_ is broken for
values such as "\x{ffffffaa}".

Not a bug.

The existing behavior was Jarkko's call during 5.7.x development, though it
seems to have escaped being documented.

See http​://www.nntp.perl.org/group/perl.perl5.porters/2000/11/msg25864.html

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

From @ikegami

On Thu, Jul 16, 2009 at 11​:49 PM, Yitzchak Scott-Thoennes
<sthoenna@​efn.org>wrote​:

Zefram wrote​:

~ is documented to operate on integers or bit strings, but "\x{aaa}"
is neither. Empirically, if ~ is applied to a string containing at least
one non-octet, the result is a string of the same length, where each
codepoint is equal to 0xffffffff minus an input codepoint. That could
conceivably be a useful operation, if it could be consistently applied
to all strings, but if all input codepoints are 0xff or below then ~
negates octets rather than 32-bit codepoints. ~~$_ eq $_ is broken for
values such as "\x{ffffffaa}".

Not a bug.

The existing behavior was Jarkko's call during 5.7.x development, though it
seems to have escaped being documented.

See
http​://www.nntp.perl.org/group/perl.perl5.porters/2000/11/msg25864.html

Yet another instance where the internal encoding leaks. Shouldn't this give
a "Wide character" warning/error? What's the point of allowing arithmetic on
something that doesn't follow the rules of arithmetic.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2009

From @ysth

On Fri, July 17, 2009 9​:16 am, Eric Brine wrote​:

On Thu, Jul 16, 2009 at 11​:49 PM, Yitzchak Scott-Thoennes wrote​:

Zefram wrote​:

~ is documented to operate on integers or bit strings, but "\x{aaa}"
is neither. Empirically, if ~ is applied to a string containing at
least one non-octet, the result is a string of the same length, where
each codepoint is equal to 0xffffffff minus an input codepoint. That
could conceivably be a useful operation, if it could be consistently
applied to all strings, but if all input codepoints are 0xff or below
then ~ negates octets rather than 32-bit codepoints. ~~$_ eq $_ is
broken for values such as "\x{ffffffaa}".

Not a bug.

The existing behavior was Jarkko's call during 5.7.x development,
though it seems to have escaped being documented.

See
http​://www.nntp.perl.org/group/perl.perl5.porters/2000/11/msg25864.html

Yet another instance where the internal encoding leaks. Shouldn't this
give a "Wide character" warning/error? What's the point of allowing
arithmetic on something that doesn't follow the rules of arithmetic.

No, it doesn't leak, quite intentionally. ~ behaves differently based on
whether the string contains characters gt "\x7f", not based on encoding.

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2009

From @ysth

On Fri, July 17, 2009 7​:56 pm, Yitzchak Scott-Thoennes wrote​:

On Fri, July 17, 2009 9​:16 am, Eric Brine wrote​:

On Thu, Jul 16, 2009 at 11​:49 PM, Yitzchak Scott-Thoennes wrote​:

Zefram wrote​:

~ is documented to operate on integers or bit strings, but "\x{aaa}"
is neither. Empirically, if ~ is applied to a string containing at
least one non-octet, the result is a string of the same length,
where each codepoint is equal to 0xffffffff minus an input
codepoint. That could conceivably be a useful operation, if it
could be consistently applied to all strings, but if all input
codepoints are 0xff or below then ~ negates octets rather than
32-bit codepoints. ~~$_ eq $_ is
broken for values such as "\x{ffffffaa}".

Not a bug.

The existing behavior was Jarkko's call during 5.7.x development,
though it seems to have escaped being documented.

See
http​://www.nntp.perl.org/group/perl.perl5.porters/2000/11/msg25864.html

Yet another instance where the internal encoding leaks. Shouldn't this
give a "Wide character" warning/error? What's the point of allowing
arithmetic on something that doesn't follow the rules of arithmetic.

No, it doesn't leak, quite intentionally. ~ behaves differently based on
whether the string contains characters gt "\x7f", not based on encoding.

Gah. I mean "\xff".

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2016

From @dcollinsn

I note that this is still open, even though it was declared "not a bug". I'm not nearly smart enough to figure out what should be happening here, can someone help?

--
Respectfully,
Dan Collins

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2016

From @khwilliamson

On Mon Jul 18 13​:15​:54 2016, dcollinsn@​gmail.com wrote​:

I note that this is still open, even though it was declared "not a
bug". I'm not nearly smart enough to figure out what should be
happening here, can someone help?

If you run it now, you see that the OP request has been fulfilled. It does not yet throw an exception, but it now does warn, so I'm resolving this ticket.

$ blead -lwe 'print ord("\x{aaa}")'
Use of strings with code points over 0xFF as arguments to 1's complement (
) operator is deprecated at -e line 1.
Use of code point 0xFFFFFFFFFFFFF555 is deprecated; the permissible max is 0x7FFFFFFFFFFFFFFF at -e line 1.
Use of code point 0xFFFFFFFFFFFFF555 is deprecated; the permissible max is 0x7FFFFFFFFFFFFFFF in ord at -e line 1.
18446744073709548885

(on a 64-bit system; the original was 32-bit)

It may not have been a bug, but eventually we decided the design decision was wrong, and we are now in a deprecation cycle to forbid this, as the OP requested.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 18, 2016

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant