Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text::CSV::Encoded is incorrectly forced to parse widechar #15739

Closed
p5pRT opened this issue Nov 28, 2016 · 10 comments
Closed

Text::CSV::Encoded is incorrectly forced to parse widechar #15739

p5pRT opened this issue Nov 28, 2016 · 10 comments

Comments

@p5pRT
Copy link

p5pRT commented Nov 28, 2016

Migrated from rt.perl.org#130199 (status was 'rejected')

Searchable as RT130199$

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2016

From rafal@zorro.ztk-rp.eu

Created by rafal@zorro.ztk-rp.eu

After upgrading from debian-wheezy to debian-jessie HTML​::Mason started
to behave strangely with respect to UTF8 encoding. Earlier both web-pages
and forms were working correctly (in UTF8) without any special setup. As
of jessie with Apache 2.4 UTF8 no longer works.
1. I had to add binmode(STDOUT,'UTF8') to modules.
2. I had to decode_utf8($_) data from forms before passing them over
to psql-db
This report I file with example code of erratic behavior of Text​::CSV​::Encoded
since I could narrow the problem to just a few lines of test-case.

========================
#!/usr/bin/perl
use Text​::CSV​::Encoded;
open(my $FH, shift) or die "open";
binmode($FH, "​:encoding(cp1250) :raw :bytes");
local $/ = "\r\n";
my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250",
  binary => 1, eol => $/, sep_char => ';',
  } ) or die "Cannot use CSV​: ".Text​::CSV->error_diag ();
$\ = "\n";
while ( <$FH> ) {
  s/\s+$//;
  print;
  if ($csv->parse( $_ )) {
  print $csv->fields();
  }
}
__END__
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"

In this example​:
1. the test file (provided "inline") as <DATA> contains two speciffic
characters from CODE-PAGE-1250, one such char just after another.
1a. this test file IS-NOT UTF8 encoded.
2. the input stream is correctly marked as CP1250
3. the module gets correct information as to that file encoding
... and yet, the module complains about encoutering a "wide-char", which in
the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
Wide character in subroutine entry at /usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH> chunk 1.
$

This result is incorrect, since the file does not contain any "wide chars".

Perl Info

Flags:
    category=core
    severity=high

Site configuration information for perl 5.20.2:

Configured by Debian Project at Fri Jul 22 15:47:27 UTC 2016.

Summary of my perl5 (revision 5 version 20 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=3.16.0-4-amd64, archname=x86_64-linux-gnu-thread-multi
    uname='linux himalia 3.16.0-4-amd64 #1 smp debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 gnulinux '
    config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Dldflags= -Wl,-z,relro -Dlddlflags=-shared -Wl,-z,relro -Dcccdlflags=-fPIC -Darchname=x86_64-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.20 -Darchlib=/usr/lib/x86_64-linux-gnu/perl/5.20 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/x86_64-linux-gnu/perl5/5.20 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.20.2 -Dsitearch=/usr/local/lib/x86_64-linux-gnu/perl/5.20.2 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dusesitecustomize -Duse64bitint -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -Ui_libutil -Uversiononly -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.20.2 -des'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2 -g',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fwrapv -fno-strict-aliasing -pipe -I/usr/local/include'
    ccversion='', gccversion='4.9.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
    perllibs=-ldl -lm -lpthread -lc -lcrypt
    libc=libc-2.19.so, so=so, useshrplib=true, libperl=libperl.so.5.20
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib -fstack-protector'

Locally applied patches:
    DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN.
    DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check.
    DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information.
    DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories.
    DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes.
    DEBPKG:debian/libperl_embed_doc - http://bugs.debian.org/186778 Note that libperl-dev package is required for embedded linking
    DEBPKG:fixes/respect_umask - Respect umask during installation
    DEBPKG:debian/writable_site_dirs - Set umask approproately for site install directories
    DEBPKG:debian/extutils_set_libperl_path - EU:MM: set location of libperl.a under /usr/lib
    DEBPKG:debian/no_packlist_perllocal - Don't install .packlist or perllocal.pod for perl or vendor
    DEBPKG:debian/prefix_changes - Fiddle with *PREFIX and variables written to the makefile
    DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets.
    DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor.
    DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy.
    DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable.
    DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian
    DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy
    DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need.
    DEBPKG:fixes/net_smtp_docs - [rt.cpan.org #36038] http://bugs.debian.org/100195 Document the Net::SMTP 'Port' option
    DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local
    DEBPKG:debian/deprecate-with-apt - http://bugs.debian.org/747628 Point users to Debian packages of deprecated core modules
    DEBPKG:debian/squelch-locale-warnings - http://bugs.debian.org/508764 Squelch locale warnings in Debian package maintainer scripts
    DEBPKG:debian/skip-upstream-git-tests - Skip tests specific to the upstream Git repository
    DEBPKG:debian/patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.20.2-3+deb8u6 in patchlevel.h
    DEBPKG:debian/skip-kfreebsd-crash - http://bugs.debian.org/628493 [perl #96272] Skip a crashing test case in t/op/threads.t on GNU/kFreeBSD
    DEBPKG:fixes/document_makemaker_ccflags - http://bugs.debian.org/628522 [rt.cpan.org #68613] Document that CCFLAGS should include $Config{ccflags}
    DEBPKG:debian/find_html2text - http://bugs.debian.org/640479 Configure CPAN::Distribution with correct name of html2text
    DEBPKG:debian/perl5db-x-terminal-emulator.patch - http://bugs.debian.org/668490 Invoke x-terminal-emulator rather than xterm in perl5db.pl
    DEBPKG:debian/cpan-missing-site-dirs - http://bugs.debian.org/688842 Fix CPAN::FirstTime defaults with nonexisting site dirs if a parent is writable
    DEBPKG:fixes/memoize_storable_nstore - [rt.cpan.org #77790] http://bugs.debian.org/587650 Memoize::Storable: respect 'nstore' option not respected
    DEBPKG:debian/regen-skip - Skip a regeneration check in unrelated git repositories
    DEBPKG:fixes/regcomp-mips-optim - [perl #122817] http://bugs.debian.org/754054 Downgrade the optimization of regcomp.c on mips and mipsel due to a gcc-4.9 bug
    DEBPKG:debian/makemaker-pasthru - http://bugs.debian.org/758471 Pass LD settings through to subdirectories
    DEBPKG:fixes/perldoc-less-R - [rt.cpan.org #98636] http://bugs.debian.org/758689 Tell the 'less' pager to allow terminal escape sequences
    DEBPKG:fixes/pod_man_reproducible_date - http://bugs.debian.org/759405 Support POD_MAN_DATE in Pod::Man for the left-hand footer
    DEBPKG:fixes/io_uncompress_gunzip_inmemory - http://bugs.debian.org/747363 [rt.cpan.org #95494] Fix gunzip to in-memory file handle
    DEBPKG:fixes/socket_test_recv_fix - http://bugs.debian.org/758718 [perl #122657] Compare recv return value to peername in socket test
    DEBPKG:fixes/hurd_socket_recv_todo - http://bugs.debian.org/758718 [perl #122657] TODO checking the result of recv() on hurd
    DEBPKG:fixes/regexp-performance - [0fa70a0] http://bugs.debian.org/777556 [perl #123743] simpify and speed up /.*.../ handling
    DEBPKG:fixes/failed_require_diagnostics - http://bugs.debian.org/781120 [perl #123270] Report inaccesible file on failed require
    DEBPKG:fixes/array-cloning - http://bugs.debian.org/779357 [perl #124127] [902d169] fix cloning arrays with unused elements
    DEBPKG:fixes/perldb-threads - http://bugs.debian.org/779357 [perl #124127] [41ef2c6] lib/perl5db.pl: Restore noop lock prototype
    DEBPKG:fixes/CVE-2015-8607_file_spec_taint_fix - ensure File::Spec::canonpath() preserves taint
    DEBPKG:fixes/encode-unicode-bom - http://bugs.debian.org/798727 [rt.cpan.org #107043] Address https://rt.cpan.org/Public/Bug/Display.html?id=107043
    DEBPKG:debian/encode-unicode-bom-doc - http://bugs.debian.org/798727 Document Debian backport of Encode::Unicode fix
    DEBPKG:debian/kfreebsd-softupdates - http://bugs.debian.org/796798 Work around Debian Bug#796798
    DEBPKG:fixes/CVE-2016-2381_duplicate_env - remove duplicate environment variables from environ
    DEBPKG:debian/debugperl-compat-fix - [perl #127212] http://bugs.debian.org/810326 Disable PERL_TRACK_MEMPOOL for debugging builds
    DEBPKG:fixes/CVE-2015-8853_regexp_hang - http://bugs.debian.org/821848 [perl #123562] PATCH [perl #123562] Regexp-matching "hangs"
    DEBPKG:fixes/utf8_regexp_crash - http://bugs.debian.org/820328 [perl #124109] save_re_context(): do "local $n" with no PL_curpm
    DEBPKG:fixes/regcomp_whitespace_fix - http://bugs.debian.org/820328 [perl #124109] Perl_save_re_context(): re-indent after last commit
    DEBPKG:fixes/5.20.3/eval_label_crash - http://bugs.debian.org/822336 [perl #123652] eval {label:} crash
    DEBPKG:fixes/5.20.3/preserve_record_separator - http://bugs.debian.org/822336 [perl #123218] "preserve" $/ if set to a bad value
    DEBPKG:fixes/5.20.3/test_count_base_rs - http://bugs.debian.org/822336 Fix test count in t/base/rs.t
    DEBPKG:fixes/5.20.3/remove_get_magic - http://bugs.debian.org/822336 [perl #123739] Remove get-magic from $/
    DEBPKG:fixes/5.20.3/speed_up_scalar_g - http://bugs.debian.org/822336 [perl #123202] speed up scalar //g against tainted strings
    DEBPKG:fixes/5.20.3/accidental_all_features - http://bugs.debian.org/822336 Stop $^H |= 0x1c020000 from enabling all features
    DEBPKG:fixes/5.20.3/multidimensional_arrays_utf8 - http://bugs.debian.org/822336 [perl #124113] Make check for multi-dimensional arrays be UTF8-aware
    DEBPKG:fixes/5.20.3/unquoted_utf8_heredoc_terminators - http://bugs.debian.org/822336 Allow unquoted UTF-8 HERE-document terminators
    DEBPKG:fixes/5.20.3/parentheses_ambiguous_warning_utf8_functions - http://bugs.debian.org/822336 Fix "...without parentheses is ambuguous" warning for UTF-8 function names
    DEBPKG:fixes/5.20.3/leak_namepv_copy - http://bugs.debian.org/822336 [perl #123786] don't leak the temp utf8 copy of namepv
    DEBPKG:fixes/5.20.3/h2ph_hex_constants - http://bugs.debian.org/822336 h2ph: correct handling of hex constants for the preamble
    DEBPKG:fixes/5.20.3/leftbracket_XTERMORDORDOR - http://bugs.debian.org/822336 [perl #123711] Fix crash with 0-5x-l{0}
    DEBPKG:fixes/5.20.3/fatalize_warnings_unwinding - http://bugs.debian.org/822336 [perl #123398] don't fatalize warnings during unwinding (#123398)
    DEBPKG:fixes/5.20.3/setpgrp - http://bugs.debian.org/822336 =?UTF-8?q?Don=E2=80=99t=20treat=20setpgrp($nonzero)=20as=20setpgr?= =?UTF-8?q?p(1)?=
    DEBPKG:fixes/5.20.3/death_unwinding_crash - http://bugs.debian.org/822336 [perl #124156] RT #124156: death during unwinding causes crash
    DEBPKG:fixes/5.20.3/stashpvn_crash - http://bugs.debian.org/822336 [perl #125541] Fix crash with %::=(); J->${\"::"}
    DEBPKG:fixes/5.20.3/possessive_quantifier - http://bugs.debian.org/822336 [perl #125825] PATCH: [perl 125825] {n}+ possessive quantifier broken
    DEBPKG:fixes/5.20.3/quoted_code_crash - http://bugs.debian.org/822336 [perl #123712] Fix /$a[/ parsing
    DEBPKG:fixes/5.20.3/checking_sub_inwhat - http://bugs.debian.org/822336 [perl #123712] Don't check sub_inwhat
    DEBPKG:fixes/5.20.3/yylex_loop - http://bugs.debian.org/822336 Fix hang with "@{"
    DEBPKG:fixes/5.20.3/docs/op - http://bugs.debian.org/822336 Fix apidocs for OP_TYPE_IS(_OR_WAS) - arguments separated by |, not ,.
    DEBPKG:fixes/5.20.3/docs/encoding - http://bugs.debian.org/822336 perlpodspec: Corrections/adds to detecting =encoding
    DEBPKG:fixes/5.20.3/docs/SvPV_set - http://bugs.debian.org/822336 improve SvPV_set's docs, it really shouldn't be public API
    DEBPKG:fixes/5.20.3/docs/autodie - http://bugs.debian.org/822336 Fix warning message regarding "use autodie" and "use open".
    DEBPKG:fixes/5.20.3/docs/autodie_2_26 - http://bugs.debian.org/822336 perlunicook: Note that autodie >= 2.26 should be okay with "use open".
    DEBPKG:fixes/5.20.3/docs/setenv - http://bugs.debian.org/822336 Fix setenv() replacement documentation in perlclib
    DEBPKG:fixes/5.20.3/docs/clib_caution - http://bugs.debian.org/822336 perlhacktips: Add caution about clib ptr returns to static memory
    DEBPKG:fixes/5.20.3/docs/perlunicook_typos - http://bugs.debian.org/822336 Fix minor code typos in perlunicook
    DEBPKG:fixes/5.20.3/docs/ook_example - http://bugs.debian.org/822336 [perl #122322] Update OOK example in perlguts
    DEBPKG:fixes/5.20.3/docs/study_noop - http://bugs.debian.org/822336 perlfunc: mention that study() is currently a noop
    DEBPKG:fixes/CVE-2016-1238/remove-dot-when-loading - [perl #127834] (perl #127834) remove . from the end of @INC if complex modules are loaded
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-padwalker - [perl #127834] perl5db.pl: ensure PadWalker is loaded from standard paths
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-dist - [perl #127834] dist/: remove . from @INC when loading optional modules
    DEBPKG:fixes/CVE-2016-1238/remove-dot-in-cpan - [perl #127834] cpan/: remove . from @INC when loading optional modules
    DEBPKG:fixes/CVE-2016-1238/customized-encode - Update customized.dat for cpan/Encode/Encode.pm
    DEBPKG:debian/CVE-2016-1238/test-suite-without-dot - [perl #127810] Patch unit tests to explicitly insert "." into @INC when needed.
    DEBPKG:debian/CVE-2016-1238/eumm-without-dot - [perl #127810] Add PERL_USE_UNSAFE_INC support to EU::MM for fortify_inc support.
    DEBPKG:debian/CVE-2016-1238/cpan-without-dot - [perl #127810] Set PERL_USE_UNSAFE_INC for cpan usage
    DEBPKG:debian/CVE-2016-1238/mb-without-dot - Make Module::Build set PERL_USE_UNSAFE_INC
    DEBPKG:debian/CVE-2016-1238/sitecustomize-in-etc - Look for sitecustomize.pl in /etc/perl rather than sitelib on Debian systems
    DEBPKG:fixes/xsloader-eval - [rt.cpan.org #115808] http://bugs.debian.org/829578 =?UTF-8?q?Don=E2=80=99t=20let=20XSLoader=20load=20relative=20path?= =?UTF-8?q?s?=


@INC for perl 5.20.2:
    /etc/perl
    /usr/local/lib/x86_64-linux-gnu/perl/5.20.2
    /usr/local/share/perl/5.20.2
    /usr/lib/x86_64-linux-gnu/perl5/5.20
    /usr/share/perl5
    /usr/lib/x86_64-linux-gnu/perl/5.20
    /usr/share/perl/5.20
    /usr/local/lib/site_perl


Environment for perl 5.20.2:
    HOME=/home/rafal
    LANG=pl_PL.utf8
    LANGUAGE=en_US:en
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/rafal/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2016

From @jkeenan

On Mon, 28 Nov 2016 12​:34​:02 GMT, rafal@​zorro.ztk-rp.eu wrote​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu,
generated with the help of perlbug 1.40 running under perl 5.20.2.

-----------------------------------------------------------------
[Please describe your issue here]
After upgrading from debian-wheezy to debian-jessie HTML​::Mason
started
to behave strangely with respect to UTF8 encoding. Earlier both web-
pages
and forms were working correctly (in UTF8) without any special setup.
As
of jessie with Apache 2.4 UTF8 no longer works.
1. I had to add binmode(STDOUT,'UTF8') to modules.
2. I had to decode_utf8($_) data from forms before passing them over
to psql-db
This report I file with example code of erratic behavior of
Text​::CSV​::Encoded
since I could narrow the problem to just a few lines of test-case.

========================
#!/usr/bin/perl
use Text​::CSV​::Encoded;
open(my $FH, shift) or die "open";
binmode($FH, "​:encoding(cp1250) :raw :bytes");
local $/ = "\r\n";
my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250",
binary => 1, eol => $/, sep_char => ';',
} ) or die "Cannot use CSV​: ".Text​::CSV->error_diag
();
$\ = "\n";
while ( <$FH> ) {
s/\s+$//;
print;
if ($csv->parse( $_ )) {
print $csv->fields();
}
}
__END__
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"

In this example​:
1. the test file (provided "inline") as <DATA> contains two speciffic
characters from CODE-PAGE-1250, one such char just after another.
1a. this test file IS-NOT UTF8 encoded.
2. the input stream is correctly marked as CP1250
3. the module gets correct information as to that file encoding
... and yet, the module complains about encoutering a "wide-char",
which in
the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
Wide character in subroutine entry at
/usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH> chunk
1.
$

This result is incorrect, since the file does not contain any "wide
chars".

It appears that the file does indeed contain characters which satisfy the condition required for the "Wide characters" warning. Here's what pod/perldiag.pod in perl-5.24.0 says​:

#####
=item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the C<​:utf8> layer to the
output, e.g. C<binmode STDOUT, '​:utf8'>. Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.
#####

If I put your test data into a file and run it through 'od -c', I observe two characters in the >255 range.

#####
$ od -c warsaw.txt
0000000 1 0 ; " S P 323 243 D Z I E L N I A
0000020 \n W A R S Z A W A " ; 6 2 ; " T
0000040 E S T " \n
0000045
#####

Text​::CSV​::Encoded is not part of the Perl 5 core distribution, so I think including it in the test script muddies the waters. Here's a pure Perl reduction​:

#####
$ cat 2-130199-text-csv-encoded.pl
# perl
use strict;
use warnings;

open(my $FH, '<', 'warsaw.txt') or die "open";
binmode($FH, "​:encoding(cp1250)");
while ( <$FH> ) {
  s/\s+$//;
  print "$_\n";
}
close $FH or die "close";
#####
$ perl 2-130199-text-csv-encoded.pl
Wide character in print at 2-130199-text-csv-encoded.pl line 9, <$FH> line 1.
10;"SPÓŁDZIELNIA
WARSZAWA";62;"TEST"
#####

I think that warning is appropriate. However, I concede that I don't have much experience with 'cp1250' so I'm unclear what the expected behavior is. Other people on list should comment.

Thank you very much.

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2016

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 28, 2016

From @jkeenan

On Mon, 28 Nov 2016 23​:03​:51 GMT, jkeenan wrote​:

On Mon, 28 Nov 2016 12​:34​:02 GMT, rafal@​zorro.ztk-rp.eu wrote​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu,
generated with the help of perlbug 1.40 running under perl 5.20.2.

-----------------------------------------------------------------
[Please describe your issue here]
After upgrading from debian-wheezy to debian-jessie HTML​::Mason
started
to behave strangely with respect to UTF8 encoding. Earlier both web-
pages
and forms were working correctly (in UTF8) without any special setup.
As
of jessie with Apache 2.4 UTF8 no longer works.
1. I had to add binmode(STDOUT,'UTF8') to modules.
2. I had to decode_utf8($_) data from forms before passing them over
to psql-db
This report I file with example code of erratic behavior of
Text​::CSV​::Encoded
since I could narrow the problem to just a few lines of test-case.

========================
#!/usr/bin/perl
use Text​::CSV​::Encoded;
open(my $FH, shift) or die "open";
binmode($FH, "​:encoding(cp1250) :raw :bytes");
local $/ = "\r\n";
my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250",
binary => 1, eol => $/, sep_char => ';',
} ) or die "Cannot use CSV​: ".Text​::CSV->error_diag
();
$\ = "\n";
while ( <$FH> ) {
s/\s+$//;
print;
if ($csv->parse( $_ )) {
print $csv->fields();
}
}
__END__
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"

In this example​:
1. the test file (provided "inline") as <DATA> contains two speciffic
characters from CODE-PAGE-1250, one such char just after another.
1a. this test file IS-NOT UTF8 encoded.
2. the input stream is correctly marked as CP1250
3. the module gets correct information as to that file encoding
... and yet, the module complains about encoutering a "wide-char",
which in
the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
Wide character in subroutine entry at
/usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH>
chunk
1.
$

This result is incorrect, since the file does not contain any "wide
chars".

It appears that the file does indeed contain characters which satisfy
the condition required for the "Wide characters" warning. Here's what
pod/perldiag.pod in perl-5.24.0 says​:

#####
=item Wide character in %s

(S utf8) Perl met a wide character (>255) when it wasn't expecting
one. This warning is by default on for I/O (like print). The easiest
way to quiet this warning is simply to add the C<​:utf8> layer to the
output, e.g. C<binmode STDOUT, '​:utf8'>. Another way to turn off the
warning is to add C<no warnings 'utf8';> but that is often closer to
cheating. In general, you are supposed to explicitly mark the
filehandle with an encoding, see L<open> and L<perlfunc/binmode>.
#####

If I put your test data into a file and run it through 'od -c', I
observe two characters in the >255 range.

#####
$ od -c warsaw.txt
0000000 1 0 ; " S P 323 243 D Z I E L N I
A
0000020 \n W A R S Z A W A " ; 6 2 ; "
T
0000040 E S T " \n
0000045
#####

Text​::CSV​::Encoded is not part of the Perl 5 core distribution, so I
think including it in the test script muddies the waters. Here's a
pure Perl reduction​:

#####
$ cat 2-130199-text-csv-encoded.pl
# perl
use strict;
use warnings;

open(my $FH, '<', 'warsaw.txt') or die "open";
binmode($FH, "​:encoding(cp1250)");
while ( <$FH> ) {
s/\s+$//;
print "$_\n";
}
close $FH or die "close";
#####
$ perl 2-130199-text-csv-encoded.pl
Wide character in print at 2-130199-text-csv-encoded.pl line 9, <$FH>
line 1.
10;"SPÓŁDZIELNIA
WARSZAWA";62;"TEST"
#####

I think that warning is appropriate. However, I concede that I don't
have much experience with 'cp1250' so I'm unclear what the expected
behavior is. Other people on list should comment.

Thank you very much.

On #p5p khw has pointed out an error in my analysis. 'od -c' prints octal. So these characters are below \0377 equivalent to 255.

Also, in my test program I should have applied binmode to STDOUT as well.

#####
# perl
use strict;
use warnings;

open(my $FH, '<', 'warsaw.txt') or die "open";
binmode($FH, "​:encoding(cp1250)");
binmode(STDOUT, "​:encoding(cp1250)");
while ( <$FH> ) {
  s/\s+$//;
  print "$_\n";
}
close $FH or die "close";
#####
$ perl 2-130199-text-csv-encoded.pl
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
#####

And once I 'binmode' STDOUT, the "Wide character" warning goes away. So, notwithstanding my errors, I still think this is not a bug -- at least not in perl-5.24.0.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Nov 29, 2016

From @eserte

Dana Mon, 28 Nov 2016 04​:34​:02 -0800, rafal@​zorro.ztk-rp.eu reče​:

This is a bug report for perl from rafal@​zorro.ztk-rp.eu,
generated with the help of perlbug 1.40 running under perl 5.20.2.

-----------------------------------------------------------------
[Please describe your issue here]
After upgrading from debian-wheezy to debian-jessie HTML​::Mason
started
to behave strangely with respect to UTF8 encoding. Earlier both web-
pages
and forms were working correctly (in UTF8) without any special setup.
As
of jessie with Apache 2.4 UTF8 no longer works.
1. I had to add binmode(STDOUT,'UTF8') to modules.
2. I had to decode_utf8($_) data from forms before passing them over
to psql-db
This report I file with example code of erratic behavior of
Text​::CSV​::Encoded
since I could narrow the problem to just a few lines of test-case.

========================
#!/usr/bin/perl
use Text​::CSV​::Encoded;
open(my $FH, shift) or die "open";
binmode($FH, "​:encoding(cp1250) :raw :bytes");
local $/ = "\r\n";
my $csv = Text​::CSV​::Encoded->new ( { encoding_in => "cp1250",
binary => 1, eol => $/, sep_char => ';',
} ) or die "Cannot use CSV​: ".Text​::CSV->error_diag
();
$\ = "\n";
while ( <$FH> ) {
s/\s+$//;
print;
if ($csv->parse( $_ )) {
print $csv->fields();
}
}
__END__
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"

In this example​:
1. the test file (provided "inline") as <DATA> contains two speciffic
characters from CODE-PAGE-1250, one such char just after another.
1a. this test file IS-NOT UTF8 encoded.
2. the input stream is correctly marked as CP1250
3. the module gets correct information as to that file encoding
... and yet, the module complains about encoutering a "wide-char",
which in
the above setup should not ever be possible (I think).

The result of the above program is​:

$ ./wide-char test-input
10;"SPӣDZIELNIA
WARSZAWA";62;"TEST"
Wide character in subroutine entry at
/usr/share/perl5/Text/CSV/Encoded/Coder/Encode.pm line 37, <$FH> chunk
1.
$

This result is incorrect, since the file does not contain any "wide
chars".

[Please do not change anything below this line]
-----------------------------------------------------------------

As it seems to make a difference if the CSV file has DOS or UNIX newlines --- can you attach the sample file? (In any case, either with DOS or UNIX newlines I don't see different behavior between Debian's perl in wheezy and jessie)

@p5pRT
Copy link
Author

p5pRT commented Nov 29, 2016

From cm.perl@abtela.com

Le 28/11/2016 à 13​:34, (via RT) a écrit :

LANG=pl\_PL\.utf8
LANGUAGE=en\_US&#8203;:en

Maybe a wild shot but isn't that combination asking for trouble ? FWIW,
see http​://stackoverflow.com/a/2510548

@p5pRT
Copy link
Author

p5pRT commented Dec 2, 2016

From @jkeenan

On Tue, 29 Nov 2016 08​:24​:13 GMT, slaven@​rezic.de wrote​:

Dana Mon, 28 Nov 2016 04​:34​:02 -0800, rafal@​zorro.ztk-rp.eu reče​:

[snip]

As it seems to make a difference if the CSV file has DOS or UNIX
newlines --- can you attach the sample file? (In any case, either with
DOS or UNIX newlines I don't see different behavior between Debian's
perl in wheezy and jessie)

Rafal, can you please provide the sample file as an email attachment? We will need this for further diagnosis.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Dec 25, 2016

From @jkeenan

On Fri, 02 Dec 2016 21​:55​:42 GMT, jkeenan wrote​:

On Tue, 29 Nov 2016 08​:24​:13 GMT, slaven@​rezic.de wrote​:

Dana Mon, 28 Nov 2016 04​:34​:02 -0800, rafal@​zorro.ztk-rp.eu reče​:

[snip]

As it seems to make a difference if the CSV file has DOS or UNIX
newlines --- can you attach the sample file? (In any case, either
with
DOS or UNIX newlines I don't see different behavior between Debian's
perl in wheezy and jessie)

Rafal, can you please provide the sample file as an email attachment?
We will need this for further diagnosis.

If there's no response from the original poster within a week, I will close this ticket.

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Jan 1, 2017

From @jkeenan

On Sun, 25 Dec 2016 02​:12​:24 GMT, jkeenan wrote​:

On Fri, 02 Dec 2016 21​:55​:42 GMT, jkeenan wrote​:

On Tue, 29 Nov 2016 08​:24​:13 GMT, slaven@​rezic.de wrote​:

Dana Mon, 28 Nov 2016 04​:34​:02 -0800, rafal@​zorro.ztk-rp.eu reče​:

[snip]

As it seems to make a difference if the CSV file has DOS or UNIX
newlines --- can you attach the sample file? (In any case, either
with
DOS or UNIX newlines I don't see different behavior between
Debian's
perl in wheezy and jessie)

Rafal, can you please provide the sample file as an email attachment?
We will need this for further diagnosis.

If there's no response from the original poster within a week, I will
close this ticket.

Thank you very much.

Closing as per schedule. Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Jan 1, 2017

@jkeenan - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant