Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISO-2022-JP encoded eroneously passes ISO-8859-1 characters #6591

Open
p5pRT opened this issue Jun 27, 2003 · 3 comments
Open

ISO-2022-JP encoded eroneously passes ISO-8859-1 characters #6591

p5pRT opened this issue Jun 27, 2003 · 3 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 27, 2003

Migrated from rt.perl.org#22833 (status was 'open')

Searchable as RT22833$

@p5pRT
Copy link
Author

p5pRT commented Jun 27, 2003

From roy.badami@globalgraphics.com

Created by roy.badami@globalgraphics.com

ISO-2022-JP encoder eroneously passes characters in the top half of
ISO-8859-1, rather than detecting them as invalid.

The following should fail (a pound stirling sign can't be represented
in ISO-2022-JP). Instead, \xA3 is just passed to the output.

use Encode;
encode('ISO-2022-JP', '\xA3', Encode​::FB_CROAK);

Perl Info

Flags:
    category=library
    severity=medium

Site configuration information for perl v5.8.0:

Configured by khalid at Fri Aug 16 14:54:49 BST 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=solaris, osvers=2.5.1, archname=sun4-solaris
    uname='sunos chihuahua 5.5.1 generic_103640-29 sun4u sparc sunw,ultra-1 '
    config_args='-d -Dcc=gcc -Uinstallusrbinperl -Dlibpth=/usr/lib /usr/ccs/lib -Dpager=/usr/ucb/more -Ui_gdbm -Ui_db -Dstartperl=#!/usr/local/bin/perl5 -Dprefix=/usr/local/soft/perl-5.8.0/run/default/sparc_sun_solaris2.5.1 -Dsiteprefix=/usr/local/'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include ',
    optimize='-O',
    cppflags='-fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers='solaris2.5.1'
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' '
    libpth=/usr/lib /usr/ccs/lib
    libs=-lsocket -lnsl -ldl -lm -lc
    perllibs=-lsocket -lnsl -ldl -lm -lc
    libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-fPIC', lddlflags='-G'

Locally applied patches:
    


@INC for perl v5.8.0:
    /usr/local/soft/perl-5.8.0/run/default/sparc_sun_solaris2.5.1/lib/5.8.0/sun4-solaris
    /usr/local/soft/perl-5.8.0/run/default/sparc_sun_solaris2.5.1/lib/5.8.0
    /usr/local//lib/site_perl/5.8.0/sun4-solaris
    /usr/local//lib/site_perl/5.8.0
    /usr/local//lib/site_perl
    .


Environment for perl v5.8.0:
    HOME=/u/roy
    LANG (unset)
    LANGUAGE (unset)
    LC_COLLATE=en_UK
    LC_CTYPE=en_UK
    LC_MESSAGES=C
    LC_MONETARY=en_UK
    LC_NUMERIC=en_UK
    LC_TIME=en_UK
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=:.:/u/roy/bin:/usr/local/bin:/usr/ucb:/usr/bin/bsd:/bin:/usr/bin:/usr/local/X11R5/bin:/usr/bin/X11:/usr/new:/etc:/usr/etc:/usr/5bin:/usr/local/lib/frame/bin
    PERL_BADLANG (unset)
    SHELL=/usr/local/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jun 27, 2003

From dankogai@dan.co.jp

On Friday, June 27, 2003, at 07​:18 PM, Roy Badami (via RT) wrote​:

ISO-2022-JP encoder eroneously passes characters in the top half of
ISO-8859-1, rather than detecting them as invalid.

This should've been documented that for algorithmic encodings like
ISO-2022-JP fallbacks are not as thoroughly enforced as
Encode​::XS-based encodings and one of the reason is that it is really
hard to implement meaningful fallbacks in such cases.

It is quite obvious that \xA3 or anything above \x80 is invalid but see
this case.

"\x{99f1}\x{99dd} means a camel" in ISO-2022-JP is as follows;

\x1b\x24\x42\x71\x51\x71\x4c\x1b\x28\x42 means a camel
<----------> <---------->
  JIS starts Back to Ascii

Suppose \x51 is dropped and the string above gets garbled. You may
simply say it should return
"\x71\x4c\x1b\x28\x42 means a camel" but it does not function as
fallback because from this you can't tell in what state the string was
in because it is a VALID ASCII string.

In other words, fallbacks are not implemented for algorithmic encodings
-- yet. However, at least FB_CROAK should be implemented properly.

Dan the Encode Maintainer

@p5pRT
Copy link
Author

p5pRT commented Jun 27, 2003

From roy.badami@globalgraphics.com

It is quite obvious that \xA3 or anything above \x80 is invalid but see
this case.

"\x{99f1}\x{99dd} means a camel" in ISO-2022-JP is as follows;

\x1b\x24\x42\x71\x51\x71\x4c\x1b\x28\x42 means a camel
<----------> <---------->
JIS starts Back to Ascii

Suppose \x51 is dropped and the string above gets garbled. You may
simply say it should return
"\x71\x4c\x1b\x28\x42 means a camel" but it does not function as
fallback because from this you can't tell in what state the string was
in because it is a VALID ASCII string.

I'm not sure I follow your example. But in any case, I'm talking
about encode, not decode, so the input would be a sequence of Unicode
characters.

In other words, fallbacks are not implemented for algorithmic encodings
-- yet. However, at least FB_CROAK should be implemented properly.

That's all I'm really after...

Thanks

  -roy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants