Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some parts of regex engine impose I32 limit on code points #12467

Closed
p5pRT opened this issue Oct 7, 2012 · 11 comments
Closed

Some parts of regex engine impose I32 limit on code points #12467

p5pRT opened this issue Oct 7, 2012 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Oct 7, 2012

Migrated from rt.perl.org#115166 (status was 'resolved')

Searchable as RT115166$

@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2012

From @khwilliamson

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.


Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.17.5​:

Configured by khw at Sun Oct 7 09​:13​:22 MDT 2012.

Summary of my perl5 (revision 5 version 17 subversion 5) configuration​:
  Commit id​: 5c1648b
  Platform​:
  osname=linux, osvers=2.6.35-32-generic-pae,
archname=i686-linux-thread-multi-64int-ld
  uname='linux karl 2.6.35-32-generic-pae #67-ubuntu smp mon mar 5
21​:23​:19 utc 2012 i686 gnulinux '
  config_args='-des -Dprefix=/home/khw/blead -Dusedevel
-D'optimize=-ggdb3' -A'optimize=-ggdb3' -A'optimize=-O0' -Dman1dir=none
-Dman3dir=none -DDEBUGGING -Dcc=g++ -Dusemorebits -Dusethreads'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=undef, uselongdouble=define
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='g++', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O0 -ggdb3',
  cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING
-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.4.5', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='long double', nvsize=12,
Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='g++', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib/../lib /usr/lib/../lib /lib /usr/lib
/usr/lib/i686-linux-gnu
  libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
  libc=/lib/../lib/libc.so.6, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.12'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -ggdb3 -ggdb3 -O0
-L/usr/local/lib -fstack-protector'

Locally applied patches​:


@​INC for perl 5.17.5​:

/home/khw/blead/lib/perl5/site_perl/5.17.5/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/site_perl/5.17.5
  /home/khw/blead/lib/perl5/5.17.5/i686-linux-thread-multi-64int-ld
  /home/khw/blead/lib/perl5/5.17.5
  /home/khw/blead/lib/perl5/site_perl
  .


Environment for perl 5.17.5​:
  HOME=/home/khw
  LANG=en_US.UTF-8
  LANGUAGE=en_US​:en
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)

PATH=/home/khw/bin​:/home/khw/print/bin​:/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/usr/games​:/home/khw/cxoffice/bin
  PERL5OPT=-w
  PERL_BADLANG (unset)
  SHELL=/bin/ksh
1

@p5pRT
Copy link
Author

p5pRT commented Oct 7, 2012

From @khwilliamson

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

@p5pRT
Copy link
Author

p5pRT commented Oct 8, 2012

From @Tux

On Sun, 07 Oct 2012 15​:35​:46 -0600, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

+1

--
H.Merijn Brand http​://tux.nl Perl Monger http​://amsterdam.pm.org/
using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE
http​://mirrors.develooper.com/hpux/ http​://www.test-smoke.org/
http​://qa.perl.org http​://www.goldmark.org/jeff/stupid-disclaimers/

@p5pRT
Copy link
Author

p5pRT commented Oct 8, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 21, 2012

From @khwilliamson

On 10/08/2012 12​:10 AM, H. Merijn Brand via RT wrote​:

On Sun, 07 Oct 2012 15​:35​:46 -0600, Karl Williamson
<public@​khwilliamson.com> wrote​:

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

+1

I started to implement this, and ran into two problems. One is that
there are tests in op/ver.t of v strings that exceed IV_MAX. I suppose
we could allow an exception for that, assuming that such a string would
not be used for other purposes.

The other is that complementing a scalar which contains a UTF-8 string
results in (from perlop)​:

"When complementing strings, if all characters have ordinal values under
256, then their complements will, also. But if they do not, all
characters will be in either 32- or 64-bit complements, depending on
your architecture. So for example, "~"\x{3B1}"" is "\x{FFFF_FC4E}" on
32-bit machines and "\x{FFFF_FFFF_FFFF_FC4E}" on 64-bit machines."

I think, if we were to go ahead and restrict to IV_MAX, that we would
have to complement based off that instead of UV_MAX, which would change
the results for existing code that does that.

But if we don't restrict to IV_MAX, we have issues in that some places
in the code have always assumed an IV_MAX max, and a lot of things that
aren't even necessarily about code points do as well. For example a
'for' loop max value is an IV, so if you are looping through some high
code points, you can't do

  for (my $code_point = high_number;
  $code_point < $above_IV_MAX;
  $code_point++) {}

Thus there is a dilemma that I don't see a good answer to.

@p5pRT
Copy link
Author

p5pRT commented Jul 27, 2013

From @cpansprout

On Sun Oct 07 14​:36​:18 2012, public@​khwilliamson.com wrote​:

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

What I say may not be of any consequence, but I do see uses for code
points up to U32_MAX. As for anything larger than that, I don’t care
what you do, even on 64-bit platforms. In fact, if we restrict 64-bit
platforms to U32_MAX, we could document the limit without qualifications.

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2015

From @khwilliamson

On 07/26/2013 07​:09 PM, Father Chrysostomos via RT wrote​:

On Sun Oct 07 14​:36​:18 2012, public@​khwilliamson.com wrote​:

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

What I say may not be of any consequence, but I do see uses for code
points up to U32_MAX. As for anything larger than that, I don’t care
what you do, even on 64-bit platforms. In fact, if we restrict 64-bit
platforms to U32_MAX, we could document the limit without qualifications.

Since, I've found that tr/// imposes an undocumented IV limit on
everything in it, due to the algorithm used, which could be changed.
But, I've been thinking it would be good to implement Perl 6's method of
handling grapheme clusters (they call it Normalization Form Grapheme),
which makes handling these very much easier. That would be a big win.
The way they do it is to limit Unicode code points to IV's, and use
negative ones as a heap to represent the graphemes found in the current
run of the interpreter. (I suppose one could eventually run out of
these). I haven't looked at NFG much to see if it is actually feasible,
but given there are parts of the current implementation that restrict
code points to IV_MAX, and that have to eventually be fixed to warn and
not loop when confronted with larger ones, and there is the potential of
a big win if we limit everything to IV_MAX, I think we should do it. So
I'm wondering how important the uses you see for up to U32_MAX are?

@p5pRT
Copy link
Author

p5pRT commented Sep 12, 2015

From @cpansprout

On Sep 11, 2015, at 12​:06 PM, Karl Williamson <public@​khwilliamson.com> wrote​:

On 07/26/2013 07​:09 PM, Father Chrysostomos via RT wrote​:

On Sun Oct 07 14​:36​:18 2012, public@​khwilliamson.com wrote​:

On 10/07/2012 03​:13 PM, karl williamson (via RT) wrote​:

# New Ticket Created by karl williamson
# Please include the string​: [perl #115166]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=115166 >

This is a bug report for perl from khw@​karl.(none),
generated with the help of perlbug 1.39 running under perl 5.17.5.

-----------------------------------------------------------------
Some parts of the regex engine and documentation say that it can handle
anything that fits in a UV. Other parts (I'm not sure about the
documentation) restrict things to an I32.
-----------------------------------------------------------------

I'm currently thinking that it was/is a mistake to allow anything larger
than the platform's IV_MAX. And I've been some what responsible for
that mistake.

It turns out that some code uses negative code points as markers, and
this seems like a reasonable thing to do.

I therefore propose that we break potential backwards compatibility by
restricting code points to IV_MAX. I'm not sure that deprecation cycles
are needed.

What I say may not be of any consequence, but I do see uses for code
points up to U32_MAX. As for anything larger than that, I don’t care
what you do, even on 64-bit platforms. In fact, if we restrict 64-bit
platforms to U32_MAX, we could document the limit without qualifications.

Since, I've found that tr/// imposes an undocumented IV limit on everything in it, due to the algorithm used, which could be changed. But, I've been thinking it would be good to implement Perl 6's method of handling grapheme clusters (they call it Normalization Form Grapheme), which makes handling these very much easier. That would be a big win. The way they do it is to limit Unicode code points to IV's, and use negative ones as a heap to represent the graphemes found in the current run of the interpreter. (I suppose one could eventually run out of these). I haven't looked at NFG much to see if it is actually feasible, but given there are parts of the current implementation that restrict code points to IV_MAX, and that have to eventually be fixed to warn and not loop when confronted with larger ones, and there is the potential of a big win if we limit everything to IV_MAX, I think we should do it. So I'm wondering how important the uses you see for up to U32_MAX are?

It has been a while since I said that, and now I do not remember clearly. It may be that I was thinking of using strings for sequences of arbitrary 32-bit integers, just as I currently use them for 16-bit integers. That may be too vague to deserve consideration.

If you are right that chars over IV_MAX are currently not handled correctly, then it is probably OK to do as you suggest. But it may require a deprecation cycle to weed out code that puts high codepoints in strings. While such code won’t be common, it is likely to occur in tests; and breaking people’s tests without a deprecation is not nice.

@p5pRT
Copy link
Author

p5pRT commented Nov 29, 2015

From @khwilliamson

Commit 760c7c2 deprecates code points above IV_MAX
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Aug 8, 2016

From @khwilliamson

I'm closing this, as the usage is now deprecated
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Aug 8, 2016

@khwilliamson - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant