Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Replacement list is longer than search list" is warned even if search list is in range of ASCII #14777

Closed
p5pRT opened this issue Jun 27, 2015 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 27, 2015

Migrated from rt.perl.org#125493 (status was 'open')

Searchable as RT125493$

@p5pRT
Copy link
Author

p5pRT commented Jun 27, 2015

From nezumi@cpan.org

Created by nezumi@cpan.org

As of 5.21.6, "Replacement list is longer than search list" warning seems
enabled on wide characters. However, it is enabled even if search list
contains only ASCII.

------------ >8 ------------ >8 ------------ >8 ------------
PERLIO=utf8 /usr/local/perl-5.22.0/bin/perl <<'EOF'
use warnings;
$_="3.14159";
tr/0-9/\x{6F0}-\x{6F9}/;
print $_;
EOF

Replacement list is longer than search list at - line 3.
Û³.Û±Û´Û±ÛµÛ¹
------------ 8< ------------ 8< ------------ 8< ------------

If the 3rd line of code above does not contain "-", or if search list in it
contains wide character, warning is not shown, e.g.​:

  tr/09/\x{6F0}\x{6F9}/;
  tr/\x{6F0}-\x{6F9}/\x{116C0}-\x{116C9}/;

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.22.0:

Configured by xxxxxxxx at Sat Jun 27 09:00:54 JST 2015.

Summary of my perl5 (revision 5 version 22 subversion 0) configuration:
   
  Platform:
    osname=linux, osvers=2.6.32-504.23.4.el6.i686, archname=i686-linux
    uname='linux xxx.xxx.xxx 2.6.32-504.23.4.el6.i686 #1 smp tue jun 9 18:09:42 utc 2015 i686 i686 i386 gnulinux '
    config_args='-de -Dprefix=/usr/local/perl-5.22.0'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2',
    optimize='-O2',
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.4.7 20120313 (Red Hat 4.4.7-11)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234, doublekind=3
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12, longdblkind=3
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib /lib
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.12.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.12'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'



@INC for perl 5.22.0:
    /usr/local/perl-5.22.0/lib/site_perl/5.22.0/i686-linux
    /usr/local/perl-5.22.0/lib/site_perl/5.22.0
    /usr/local/perl-5.22.0/lib/5.22.0/i686-linux
    /usr/local/perl-5.22.0/lib/5.22.0
    .


Environment for perl 5.22.0:
    HOME=/home/xxxxxxxx
    LANG=ja_JP.utf8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/xxxxxxxx/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2015

From @tonycoz

On Fri Jun 26 17​:33​:20 2015, nezumi@​cpan.org wrote​:

As of 5.21.6, "Replacement list is longer than search list" warning
seems
enabled on wide characters. However, it is enabled even if search
list
contains only ASCII.

------------ >8 ------------ >8 ------------ >8 ------------
PERLIO=utf8 /usr/local/perl-5.22.0/bin/perl <<'EOF'
use warnings;
$_="3.14159";
tr/0-9/\x{6F0}-\x{6F9}/;
print $_;
EOF

Replacement list is longer than search list at - line 3.
Û³.Û±Û´Û±ÛµÛ¹
------------ 8< ------------ 8< ------------ 8< ------------

If the 3rd line of code above does not contain "-", or if search list
in it
contains wide character, warning is not shown, e.g.​:

tr/09/\x{6F0}\x{6F9}/;
tr/\x{6F0}-\x{6F9}/\x{116C0}-\x{116C9}/;

The code added by 6a8b6cf, which enabled
the warning for wide characters, is buggy.

For the example supplied, rcount ends up as 55 (10+9+8...+2+1) which is
greater than the 10 on the left, so the warning triggers.

On the optimization side, the swash definition produced for the example is​:

0030\t\t01f0
0031\t\t01f1
0032\t\t01f2
0033\t\t01f3
0034\t\t01f4
0035\t\t01f5
0036\t\t01f6
0037\t\t01f7
0038\t\t01f8
0039\t\t01f9

which could probably be simplified.

Tony

@p5pRT
Copy link
Author

p5pRT commented Jul 1, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 7, 2016

From bkb@cpan.org

On Tue, 30 Jun 2015 23​:24​:00 -0700, tonyc wrote​:

The code added by 6a8b6cf, which enabled
the warning for wide characters, is buggy.

The code added in that commit is not actually buggy, it just reveals the bug which was there. The above commit adds a warning if there is a mismatch in UTF-8-flagged strings, but the original bug which existed before that commit was that rcount was calculated but then discarded without being used anywhere, except where it was uselessly setting rlen = rcount just before exiting that routine. Adding the above goto statement and sending it to the "warnins" (sic) section revealed that the rlen value was meaningless. The actual bug here is in the calculation of rcount.

Incidentally this bug occurred to me in the following example case, which might be a useful test whether this has been fixed​:

https://github.com/benkasminbullock/Lingua-JA-Moji/blob/75bc7f5e4e0923aa14bfcf4240ff7be270a2da13/lib/Lingua/JA/Moji.pm#L924

These lines​:

$input =~ tr/\x{3000}\x{FF01}-\x{FF5E}/ -~/;

and

$input =~ tr/ -~/\x{3000}\x{FF01}-\x{FF5E}/;

both trip the bug.

Adding a line like this into the source code of op.c after the increments of rcount and tcount​:

  Perl_warn ("BKB test​: %d %d\n", tcount, rcount);

gives values like this​:

BKB test​: 1 1
BKB test​: 95 2
BKB test​: 188 3
BKB test​: 280 4
BKB test​: 371 5
BKB test​: 461 6
BKB test​: 550 7
BKB test​: 638 8
BKB test​: 725 9
BKB test​: 811 10
BKB test​: 896 11
BKB test​: 980 12
BKB test​: 1063 13
BKB test​: 1145 14
BKB test​: 1226 15
BKB test​: 1306 16
BKB test​: 1385 17
BKB test​: 1463 18
BKB test​: 1540 19
BKB test​: 1616 20
BKB test​: 1691 21
BKB test​: 1765 22
BKB test​: 1838 23
BKB test​: 1910 24
BKB test​: 1981 25
BKB test​: 2051 26
BKB test​: 2120 27
BKB test​: 2188 28
BKB test​: 2255 29
BKB test​: 2321 30
BKB test​: 2386 31
BKB test​: 2450 32
BKB test​: 2513 33
BKB test​: 2575 34
BKB test​: 2636 35
BKB test​: 2696 36
BKB test​: 2755 37
BKB test​: 2813 38
BKB test​: 2870 39
BKB test​: 2926 40
BKB test​: 2981 41
BKB test​: 3035 42
BKB test​: 3088 43
BKB test​: 3140 44
BKB test​: 3191 45
BKB test​: 3241 46
BKB test​: 3290 47
BKB test​: 3338 48
BKB test​: 3385 49
BKB test​: 3431 50
BKB test​: 3476 51
BKB test​: 3520 52
BKB test​: 3563 53
BKB test​: 3605 54
BKB test​: 3646 55
BKB test​: 3686 56
BKB test​: 3725 57
BKB test​: 3763 58
BKB test​: 3800 59
BKB test​: 3836 60
BKB test​: 3871 61
BKB test​: 3905 62
BKB test​: 3938 63
BKB test​: 3970 64
BKB test​: 4001 65
BKB test​: 4031 66
BKB test​: 4060 67
BKB test​: 4088 68
BKB test​: 4115 69
BKB test​: 4141 70
BKB test​: 4166 71
BKB test​: 4190 72
BKB test​: 4213 73
BKB test​: 4235 74
BKB test​: 4256 75
BKB test​: 4276 76
BKB test​: 4295 77
BKB test​: 4313 78
BKB test​: 4330 79
BKB test​: 4346 80
BKB test​: 4361 81
BKB test​: 4375 82
BKB test​: 4388 83
BKB test​: 4400 84
BKB test​: 4411 85
BKB test​: 4421 86
BKB test​: 4430 87
BKB test​: 4438 88
BKB test​: 4445 89
BKB test​: 4451 90
BKB test​: 4456 91
BKB test​: 4460 92
BKB test​: 4463 93
BKB test​: 4465 94
BKB test​: 4466 95
BKB test​: 1 1
BKB test​: 2 95
BKB test​: 3 188
BKB test​: 4 280
BKB test​: 5 371
BKB test​: 6 461
BKB test​: 7 550
BKB test​: 8 638
BKB test​: 9 725
BKB test​: 10 811
BKB test​: 11 896
BKB test​: 12 980
BKB test​: 13 1063
BKB test​: 14 1145
BKB test​: 15 1226
BKB test​: 16 1306
BKB test​: 17 1385
BKB test​: 18 1463
BKB test​: 19 1540
BKB test​: 20 1616
BKB test​: 21 1691
BKB test​: 22 1765
BKB test​: 23 1838
BKB test​: 24 1910
BKB test​: 25 1981
BKB test​: 26 2051
BKB test​: 27 2120
BKB test​: 28 2188
BKB test​: 29 2255
BKB test​: 30 2321
BKB test​: 31 2386
BKB test​: 32 2450
BKB test​: 33 2513
BKB test​: 34 2575
BKB test​: 35 2636
BKB test​: 36 2696
BKB test​: 37 2755
BKB test​: 38 2813
BKB test​: 39 2870
BKB test​: 40 2926
BKB test​: 41 2981
BKB test​: 42 3035
BKB test​: 43 3088
BKB test​: 44 3140
BKB test​: 45 3191
BKB test​: 46 3241
BKB test​: 47 3290
BKB test​: 48 3338
BKB test​: 49 3385
BKB test​: 50 3431
BKB test​: 51 3476
BKB test​: 52 3520
BKB test​: 53 3563
BKB test​: 54 3605
BKB test​: 55 3646
BKB test​: 56 3686
BKB test​: 57 3725
BKB test​: 58 3763
BKB test​: 59 3800
BKB test​: 60 3836
BKB test​: 61 3871
BKB test​: 62 3905
BKB test​: 63 3938
BKB test​: 64 3970
BKB test​: 65 4001
BKB test​: 66 4031
BKB test​: 67 4060
BKB test​: 68 4088
BKB test​: 69 4115
BKB test​: 70 4141
BKB test​: 71 4166
BKB test​: 72 4190
BKB test​: 73 4213
BKB test​: 74 4235
BKB test​: 75 4256
BKB test​: 76 4276
BKB test​: 77 4295
BKB test​: 78 4313
BKB test​: 79 4330
BKB test​: 80 4346
BKB test​: 81 4361
BKB test​: 82 4375
BKB test​: 83 4388
BKB test​: 84 4400
BKB test​: 85 4411
BKB test​: 86 4421
BKB test​: 87 4430
BKB test​: 88 4438
BKB test​: 89 4445
BKB test​: 90 4451
BKB test​: 91 4456
BKB test​: 92 4460
BKB test​: 93 4463
BKB test​: 94 4465
BKB test​: 95 4466

Going further into this, the test *t == ILLEGAL_UTF8_BYTE or the equivalent *r == test is only applied to the side which contains UTF-8 and not to the other side. Putting UTF-8 on both sides of the tr/// solves the problem.

Here is a simple test case​:

use warnings;
use utf8;
my $input = 'x';
$input =~ tr/a-c/\x{3000}-\x{3002}/;
$input =~ tr/�-�/\x{3000}-\x{3002}/;

Only the upper tr/// trips the bug, and the counts are done correctly for the lower case. So this bug is related to conversions with UTF-8 encodings on one or the other side only.

At this point I'm out of my depth for the best way to fix this, but it seems to me that the range operator (the "-") is not being correctly converted to ILLEGAL_UTF8_BYTE when one of from_utf or to_utf is not set.

For the example supplied, rcount ends up as 55 (10+9+8...+2+1) which is
greater than the 10 on the left, so the warning triggers.

On the optimization side, the swash definition produced for the example is​:

0030\t\t01f0
0031\t\t01f1
0032\t\t01f2
0033\t\t01f3
0034\t\t01f4
0035\t\t01f5
0036\t\t01f6
0037\t\t01f7
0038\t\t01f8
0039\t\t01f9

which could probably be simplified.

Tony

@p5pRT
Copy link
Author

p5pRT commented Feb 8, 2017

From @khwilliamson

I plan on fixing this in 5.27. In the meantime, I've added the first example as a TODO test in our suite, via commit a0c4698
--
Karl Williamson

@khwilliamson
Copy link
Contributor

This has finally been fixed by
commit f34acfe
Author: Karl Williamson khw@cpan.org
Date: Mon Nov 4 21:30:48 2019 -0700

  • Reimplement tr/// without swashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants