Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tr''' does not work properly #15853

Closed
p5pRT opened this issue Jan 31, 2017 · 7 comments
Closed

tr''' does not work properly #15853

p5pRT opened this issue Jan 31, 2017 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Jan 31, 2017

Migrated from rt.perl.org#130679 (status was 'rejected')

Searchable as RT130679$

@p5pRT
Copy link
Author

p5pRT commented Jan 31, 2017

From @khwilliamson

This is a bug report for perl from khw@​cpan.org,
generated with the help of perlbug 1.40 running under perl 5.25.7.


Using a single quote as a delimiter in tr or y does not work properly.

The parsing for these commands is split between toke.c and op.c, with
the portion in toke.c mostly in S_scan_const. This function does not
get called when the delimiter is a single quote; presumably it is
expected to be only a double-quotish thing to do. scan_const creates an
intermediate form that op.c uses to finish things up. But in this case,
there is no intermediate form, and op.c is getting the raw input. This
is part of the underlying bug in
#130675. op.c is expecting the intermediate form; and the first part of
its input matches the intermediate form, right down to the single
reserved byte used by that form (which the fuzzer has managed to find).
But that reserved byte signals that there should be more to come, and
this defective input has none.

It so happens that the intermediate form is unchanged from the input for
ASCII literal characters that don't have ranges, so this works

  tr'abc'def'

but ranges don't work

  tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

  tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.



Flags​:
  category=core
  severity=low


This perlbug was built using Perl 5.25.9 - Thu Jan 19 08​:46​:21 MST 2017
It is being executed now by Perl 5.25.7 - Fri Nov 18 08​:29​:04 MST 2016.

Site configuration information for perl 5.25.7​:

Configured by khw at Fri Nov 18 08​:29​:04 MST 2016.

Summary of my perl5 (revision 5 version 25 subversion 7) configuration​:
  Commit id​: 51d89e3
  Platform​:
  osname=linux
  osvers=4.8.0-27-generic
  archname=x86_64-linux-thread-multi-ld
  uname='linux khw-inspiron-3558 4.8.0-27-generic #29-ubuntu smp thu
oct 20 21​:03​:13 utc 2016 x86_64 x86_64 x86_64 gnulinux '
  config_args='-des -Uversiononly -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0'
-Accflags='-DPERL_BOOL_AS_CHAR' -Accflags='-DPERL_EXTERNAL_GLOB'
-Dman1dir=none -Dman3dir=none -DDEBUGGING -Dcc=g++ -Dusemorebits
-Dusethreads -Accflags='-DNO_MATHOMS''
  hint=recommended
  useposix=true
  d_sigaction=define
  useithreads=define
  usemultiplicity=define
  use64bitint=define
  use64bitall=define
  uselongdouble=define
  usemymalloc=n
  bincompat5005=undef
  Compiler​:
  cc='g++'
  ccflags ='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR
-DPERL_EXTERNAL_GLOB -DNO_MATHOMS -fwrapv -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
  optimize=' -ggdb3 -O0'
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DPERL_BOOL_AS_CHAR
-DPERL_EXTERNAL_GLOB -DNO_MATHOMS -fwrapv -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
  ccversion=''
  gccversion='6.2.0 20161005'
  gccosandvers=''
  intsize=4
  longsize=8
  ptrsize=8
  doublesize=8
  byteorder=12345678
  doublekind=3
  d_longlong=define
  longlongsize=8
  d_longdbl=define
  longdblsize=16
  longdblkind=3
  ivtype='long'
  ivsize=8
  nvtype='long double'
  nvsize=16
  Off_t='off_t'
  lseeksize=8
  alignbytes=16
  prototype=define
  Linker and Libraries​:
  ld='g++'
  ldflags =' -fstack-protector-strong -L/usr/local/lib'
  libpth=/usr/include/c++/6 /usr/include/x86_64-linux-gnu/c++/6
/usr/include/c++/6/backward /usr/local/lib
/usr/lib/gcc/x86_64-linux-gnu/6/include-fixed
/usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib
/usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
  libs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  libc=libc-2.24.so
  so=so
  useshrplib=false
  libperl=libperl.a
  gnulibc_version='2.24'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs
  dlext=so
  d_dlsymun=undef
  ccdlflags='-Wl,-E'
  cccdlflags='-fPIC'
  lddlflags='-shared -ggdb3 -ggdb3 -O0 -L/usr/local/lib
-fstack-protector-strong'


@​INC for perl 5.25.7​:
/home/khw/blead/lib/perl5/site_perl/5.25.7/x86_64-linux-thread-multi-ld
  /home/khw/blead/lib/perl5/site_perl/5.25.7
/home/khw/blead/lib/perl5/5.25.7/x86_64-linux-thread-multi-ld
  /home/khw/blead/lib/perl5/5.25.7
  /home/khw/blead/lib/perl5/site_perl


Environment for perl 5.25.7​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
PATH=/usr/lib/ccache​:/home/khw/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/usr/local/games
  PERL5OPT=-w
  PERL_BADLANG (unset)
  PERL_DIFF_TOOL=wgdiff
  PERL_POD_PEDANTIC=1
  SHELL=/bin/ksh

@p5pRT
Copy link
Author

p5pRT commented Jan 31, 2017

From @tonycoz

On Mon, 30 Jan 2017 18​:24​:38 -0800, khw wrote​:

It so happens that the intermediate form is unchanged from the input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have
to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish tr'''.

Possibly it could check for broken unicode to avoid #130675.

I don't think tr''' should support escapes beyond \'

Tony

@p5pRT
Copy link
Author

p5pRT commented Jan 31, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Feb 7, 2017

From @khwilliamson

On Mon, 30 Jan 2017 20​:04​:28 -0800, tonyc wrote​:

On Mon, 30 Jan 2017 18​:24​:38 -0800, khw wrote​:

It so happens that the intermediate form is unchanged from the input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing. It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would have
to
be pulled out into a separate function, called as well from whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish tr'''.

Possibly it could check for broken unicode to avoid #130675.

Agreed

I don't think tr''' should support escapes beyond \'

Agreed

Tony

I don't know why the handling of tr/// is split between op.c and toke.c. I'm now thinking that most of it should be moved to op.c, with the toke.c part merely expanding the single- or double-quotish things. Then there would be no need to have an intermediate form, and I believe the code would be cleaner.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Feb 8, 2017

From @khwilliamson

On Mon, 06 Feb 2017 17​:05​:19 -0800, khw wrote​:

On Mon, 30 Jan 2017 20​:04​:28 -0800, tonyc wrote​:

On Mon, 30 Jan 2017 18​:24​:38 -0800, khw wrote​:

It so happens that the intermediate form is unchanged from the
input
for
ASCII literal characters that don't have ranges, so this works

tr'abc'def'

but ranges don't work

tr'a-z'A-Z' # translates only the three chars 'a', '-', 'z'

I haven't investigated \t, etc.

My guess is that it has been broken from the beginning, and it's
amazing
to me that it has gone undiscovered for so long, requiring a
fuzzer,
with the red herring of UTF-8 being involved.

One possibility to fix this is to simply prohibit single-quotish
behavior with tr, by putting in a check somewhere in the parsing.
It
clearly isn't something that is affecting many, if any, people.

Otherwise, the portion of scan_const that deals with this would
have
to
be pulled out into a separate function, called as well from
whatever
part of toke deals with single quotish. I believe that

tr '\xDF'\xFE'

should not be evaluated as double-quotish, for example.

Since I don't have much knowledge of the parser's overall
operation,
suggestions are welcome. Or volunteers.

We'd need to modify S_tokeq() to handle ranges in single-quotish
tr'''.

Possibly it could check for broken unicode to avoid #130675.

Agreed

I don't think tr''' should support escapes beyond \'

Agreed

Tony

I don't know why the handling of tr/// is split between op.c and
toke.c. I'm now thinking that most of it should be moved to op.c,
with the toke.c part merely expanding the single- or double-quotish
things. Then there would be no need to have an intermediate form,
and I believe the code would be cleaner.

I have figured out why it is split between op.c and tr.c. If op.c didn't get it before it was fully parsed, it wouldn't be able to distinguish between things like an interior hyphen was input as a literal, and hence should be treated as a metacharacter; or if it was the expansion of something like \x2D, in which case it isn't a metacharacter. We have the same problem, and have (mostly) solved it, for regex patterns that are compiled separately, but it's work, and has presented a bunch of bugs over the years
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

From @khwilliamson

This is actually working as designed. I added doc clarifications in
8efc02f.

See also http​://nntp.perl.org/group/perl.perl5.porters/251544

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2018

@khwilliamson - Status changed from 'open' to 'rejected'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant