Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser hangs on some Unicode numbers and symbols in identifiers #10275

Closed
p5pRT opened this issue Apr 3, 2010 · 25 comments
Closed

Parser hangs on some Unicode numbers and symbols in identifiers #10275

p5pRT opened this issue Apr 3, 2010 · 25 comments

Comments

@p5pRT
Copy link

p5pRT commented Apr 3, 2010

Migrated from rt.perl.org#74022 (status was 'resolved')

Searchable as RT74022$

@p5pRT
Copy link
Author

p5pRT commented Apr 3, 2010

From dgl@mogao.nonet

This is a bug report for perl from dgl@​dgl.cx
generated with the help of perlbug 1.39 running under perl 5.12.0.


If a source file or eval is in UTF-8 (i.e. use utf8 for files, or a UTF-8
string for evals) some codepoints cause the parser to hang if used unquoted.

The codepoints that cause a hang depend on the version of unicode perl is using.

Perl 5.10.0​:
PERL_SIGNALS=unsafe perl -M'charnames()' -e'$SIG{ALRM} = sub { printf "U+%X - %s\n", $_, charnames​::viacode($_); die }; for(0..0x10FFFF) { alarm 1; eval chr $_; alarm 0 }'
U+2118 - SCRIPT CAPITAL P
U+212E - ESTIMATED SYMBOL
U+309B - KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Perl 5.12.0 RC3​:
PERL_SIGNALS=unsafe perl -M'charnames()' -e'$SIG{ALRM} = sub { printf "U+%X - %s\n", $_, charnames​::viacode($_); die }; for(0..0x10FFFF) { alarm 1; eval chr $_; alarm 0 }'
U+387 - GREEK ANO TELEIA
U+1369 - ETHIOPIC DIGIT ONE
U+136A - ETHIOPIC DIGIT TWO
U+136B - ETHIOPIC DIGIT THREE
U+136C - ETHIOPIC DIGIT FOUR
U+136D - ETHIOPIC DIGIT FIVE
U+136E - ETHIOPIC DIGIT SIX
U+136F - ETHIOPIC DIGIT SEVEN
U+1370 - ETHIOPIC DIGIT EIGHT
U+1371 - ETHIOPIC DIGIT NINE
U+2118 - SCRIPT CAPITAL P
U+212E - ESTIMATED SYMBOL
U+309B - KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK



Flags​:
  category=core
  severity=low


Site configuration information for perl 5.12.0​:

Configured by dgl at Sat Apr 3 10​:17​:37 BST 2010.

Summary of my perl5 (revision 5 version 12 subversion 0) configuration​:
  Commit id​: 928e1fe
  Platform​:
  osname=linux, osvers=2.6.31-10-generic, archname=x86_64-linux
  uname='linux mogao 2.6.31-10-generic #34-ubuntu smp wed sep 16 01​:09​:15 utc 2009 x86_64 gnulinux '
  config_args='-de -Dprefix=/home/dgl/perls/bleadperl'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=define, use64bitall=define, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.4.1', gccosandvers=''
  intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64
  libs=-lnsl -ldl -lm -lcrypt -lutil -lc
  perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
  libc=/lib/libc-2.11.1.so, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version='2.11.1'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches​:
  RC3


@​INC for perl 5.12.0​:
  /home/dgl/perls/bleadperl/lib/site_perl/5.12.0/x86_64-linux
  /home/dgl/perls/bleadperl/lib/site_perl/5.12.0
  /home/dgl/perls/bleadperl/lib/5.12.0/x86_64-linux
  /home/dgl/perls/bleadperl/lib/5.12.0
  .


Environment for perl 5.12.0​:
  HOME=/home/dgl
  LANG=en_GB.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/dgl/perls/bleadperl/bin​:/home/dgl/bin​:/sbin​:/usr/sbin​:/usr/local/sbin​:/usr/local/bin​:/usr/local/sbin​:/usr/local/bin​:/usr/sbin​:/usr/bin​:/sbin​:/bin​:/usr/games
  PERL_BADLANG (unset)
  SHELL=/bin/zsh

@p5pRT
Copy link
Author

p5pRT commented Apr 4, 2010

From @khwilliamson

dgl@​mogao.nonet (via RT) wrote​:

# New Ticket Created by dgl@​mogao.nonet
# Please include the string​: [perl #74022]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=74022 >

This is a bug report for perl from dgl@​dgl.cx
generated with the help of perlbug 1.39 running under perl 5.12.0.

-----------------------------------------------------------------

If a source file or eval is in UTF-8 (i.e. use utf8 for files, or a UTF-8
string for evals) some codepoints cause the parser to hang if used unquoted.

The codepoints that cause a hang depend on the version of unicode perl is using.

Perl 5.10.0​:
PERL_SIGNALS=unsafe perl -M'charnames()' -e'$SIG{ALRM} = sub { printf "U+%X - %s\n", $_, charnames​::viacode($_); die }; for(0..0x10FFFF) { alarm 1; eval chr $_; alarm 0 }'
U+2118 - SCRIPT CAPITAL P
U+212E - ESTIMATED SYMBOL
U+309B - KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Perl 5.12.0 RC3​:
PERL_SIGNALS=unsafe perl -M'charnames()' -e'$SIG{ALRM} = sub { printf "U+%X - %s\n", $_, charnames​::viacode($_); die }; for(0..0x10FFFF) { alarm 1; eval chr $_; alarm 0 }'
U+387 - GREEK ANO TELEIA
U+1369 - ETHIOPIC DIGIT ONE
U+136A - ETHIOPIC DIGIT TWO
U+136B - ETHIOPIC DIGIT THREE
U+136C - ETHIOPIC DIGIT FOUR
U+136D - ETHIOPIC DIGIT FIVE
U+136E - ETHIOPIC DIGIT SIX
U+136F - ETHIOPIC DIGIT SEVEN
U+1370 - ETHIOPIC DIGIT EIGHT
U+1371 - ETHIOPIC DIGIT NINE
U+2118 - SCRIPT CAPITAL P
U+212E - ESTIMATED SYMBOL
U+309B - KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C - KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

-----------------------------------------------------------------

This problem goes away if 'use utf8' is added to the program, which it
should be. I agree it shouldn't hang.

@p5pRT
Copy link
Author

p5pRT commented Apr 4, 2010

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Apr 4, 2010

From @dgl

On Sat, Apr 03, 2010 at 08​:28​:42PM -0600, karl williamson wrote​:

This problem goes away if 'use utf8' is added to the program, which it
should be. I agree it shouldn't hang.

That doesn't always seem to be the case​:

  perl -C -le'print "use utf8;\n\x{212e}"' | perl

@p5pRT
Copy link
Author

p5pRT commented Apr 5, 2010

From @khwilliamson

David Leadbeater wrote​:

On Sat, Apr 03, 2010 at 08​:28​:42PM -0600, karl williamson wrote​:

This problem goes away if 'use utf8' is added to the program, which it
should be. I agree it shouldn't hang.

That doesn't always seem to be the case​:

perl -C -le'print "use utf8;\n\x{212e}"' | perl

It is looping, and eventually my computer runs out of memory. If I
store the output of the first perl in a file, I can't get the second to
fail; similarly with the evals in the initial report.

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2010

From @cpansprout

This was broken by change 30148
(<http​://perl5.git.perl.org/perl.git/commitdiff/8158862b92>), which
introduces \p{OtherIDContinue} into the set of characters matched by
\p{ID_Continue}.

In Perl_yylex in toke.c​:

  switch (*s) {
  default​:
  if (isIDFIRST_lazy_if(s,UTF))
  goto keylookup;

isIDFIRST_lazy_if returns true for characters in ID_Continue that are
not digits. (see handy.h​:

/* The ID_Start of Unicode is quite limiting​: it assumes a L-class
  * character (meaning that you cannot have, say, a CJK character).
  * Instead, let's allow ID_Continue but not digits. */
#define isIDFIRST_utf8(p) (is_utf8_idcont(p) && !is_utf8_digit(p))

)

Then further down (in toke.c)​:

  keylookup​: {
...(8 lines snipped)...
  s = scan_word(s, PL_tokenbuf, sizeof PL_tokenbuf, FALSE, &len);

S_scan_word has​:

  else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {

So characters in \p{OtherIDContinue}, such as U+387 and U+1369, get
treated as the first char of a keyword by isIDFIRST_lazy_if, but
scan_word rejects them and does not advance, since it doesn’t use
isIDFIRST_lazy_if except after a ‘'’. So we have an infinite number of
zero-length keywords....

I think scan_word should be using is_utf8_idcont, rather than
isALNUM_utf8. The attached patch makes it do just this.

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2010

From @cpansprout

Inline Patch
diff -Nurp blead/t/comp/parser.t blead-74022/t/comp/parser.t
--- blead/t/comp/parser.t	2010-01-05 14:08:09.000000000 -0800
+++ blead-74022/t/comp/parser.t	2010-04-25 14:12:11.000000000 -0700
@@ -3,7 +3,7 @@
 # Checks if the parser behaves correctly in edge cases
 # (including weird syntax errors)
 
-print "1..122\n";
+print "1..123\n";
 
 sub failed {
     my ($got, $expected, $name) = @_;
@@ -353,6 +353,11 @@ eval q{
 };
 is($@, "", "multiline whitespace inside substitute expression");
 
+# bug #74022: Loop on characters in \p{OtherIDContinue}
+# This test hangs if it fails.
+eval chr 0x387;
+is(1,1, '[perl #74022] Parser looping on OtherIDContinue chars');
+
 # Add new tests HERE:
 
 # More awkward tests for #line. Keep these at the end, as they will screw
Binary files blead/t/perl and blead-74022/t/perl differ
Inline Patch
diff -Nurp blead/toke.c blead-74022/toke.c
--- blead/toke.c	2010-04-23 05:36:10.000000000 -0700
+++ blead-74022/toke.c	2010-04-25 14:03:13.000000000 -0700
@@ -11662,7 +11662,7 @@ S_scan_word(pTHX_ register char *s, char
 	    *d++ = *s++;
 	    *d++ = *s++;
 	}
-	else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {
+	else if (UTF && UTF8_IS_START(*s) && is_utf8_idcont((U8*)s)) {
 	    char *t = s + UTF8SKIP(s);
 	    size_t len;
 	    while (UTF8_IS_CONTINUED(*t) && is_utf8_mark((U8*)t))

@p5pRT
Copy link
Author

p5pRT commented Apr 26, 2010

From @cpansprout

On Apr 25, 2010, at 2​:21 PM, Father Chrysostomos wrote​:

I think scan_word should be using is_utf8_idcont, rather than
isALNUM_utf8. The attached patch makes it do just this.

The tests had not finished running when I sent that. lib/utf8.t is
failing. It turns out that things are not as simple as I thought.
toke.c has 23 instances of isIDFIRST_lazy_if, so it seems that most of
the code is expecting S_scan_word to match something like

  /^(?!\p{IsDigit})[\p{ID_Continue}_]+/

whereas what it actually matches (ignoring package separators) is

  /^([\p{IsWord}_]\pM?)*/

My patch prevents qq·aaa· from being valid syntax, because U+B7 is
part of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my
patch didn’t address was the \pM? (is_utf8_mark) part of scan_word. \p
{ID_Continue} contains all of \pM except for the thirteen characters
in \p{Me}.

So there is a potential for breakage if we make everything match
Unicode. The macro handy.h is already explicitly looser than Unicode.
Fixing this bug requires an arbitrary decision from someone more
knowledgeable than I.

@p5pRT
Copy link
Author

p5pRT commented Apr 28, 2010

From @khwilliamson

Father Chrysostomos wrote​:

On Apr 25, 2010, at 2​:21 PM, Father Chrysostomos wrote​:

I think scan_word should be using is_utf8_idcont, rather than
isALNUM_utf8. The attached patch makes it do just this.

The tests had not finished running when I sent that. lib/utf8.t is
failing. It turns out that things are not as simple as I thought. toke.c
has 23 instances of isIDFIRST_lazy_if, so it seems that most of the code
is expecting S_scan_word to match something like

/^(?!\p{IsDigit})[\p{ID_Continue}_]+/

whereas what it actually matches (ignoring package separators) is

/^([\p{IsWord}_]\pM?)*/

My patch prevents qq·aaa· from being valid syntax, because U+B7 is part
of \p{ID_Continue} (hence the lib/utf8.t failure). One thing my patch
didn’t address was the \pM? (is_utf8_mark) part of scan_word.
\p{ID_Continue} contains all of \pM except for the thirteen characters
in \p{Me}.

So there is a potential for breakage if we make everything match
Unicode. The macro handy.h is already explicitly looser than Unicode.
Fixing this bug requires an arbitrary decision from someone more
knowledgeable than I.

Thanks for finding this. I've wondered about the comment in handy.h
that you quoted that documents that we decided to use a Perl home-grown
version of this rather than the official Unicode one. To repeat, it is

/* The ID_Start of Unicode is quite limiting​: it assumes a L-class
  * character (meaning that you cannot have, say, a CJK character).
  * Instead, let's allow ID_Continue but not digits. */

Jarkko wrote that comment in 2002. Since then (actually quite a long
time ago), Unicode has fixed this problem, and the official ID_Start
does include Han characters and Korean syllables.

Jarkko wrote me last year that "Unicode knows best". In other words,
they will, in general, do a better job than people at Perl could
possibly do at figuring out what's best. They haven't always, but it's
getting better as the Standard has evolved, and is stabilizing over
time. They've put more and more things into place to minimize errors,
but that's not to say those have gone to zero.

In 5.12, I took Jarkko's advice, and changed our definitions of \p
properties to be identical to Unicode's. And people agreed with that
decision, so that's what got shipped.

I had been planning to look at this area too, and your posts spurred me
to do it. What I think is that we should move to Unicode's definitions,
even if it means breaking some existing code. Going forward, then, we
won't have to worry about it, as those definitions get added to (and
perhaps modified); we just follow the Standard.

The middle dot that caused your test to fail is one that Unicode has had
some issues with knowing how to handle. I haven't checked if it has
changed with regard to this property, but it has in others. But that
has settled down, and remained unchanged in recent releases.

Actually, I think we should move not to ID_Start, but to Unicode's
revised property, XID_Start which they recommend over the earlier one,
and is nearly identical, but better handles a few weirdly behaving
characters, in Thai, Lao, Greek, and Arabic mostly. 5.12's regex \X
construct uses a similar Unicode definition that takes these into
account, and it automatically fixes the issue with marks that you
pointed out.

Unicode is keeping ID_Start around for backwards compatibility. I don't
know if they intend to do so indefinitely or not. My guess is that it
will be there for quite some time to come.

To summarize, I propose that we use Unicode's XID_Start and XID_Continue
properties in 5.14, even though that breaks one of our tests, and
possibly existing code.

@p5pRT
Copy link
Author

p5pRT commented May 2, 2010

From @khwilliamson

Since this causes Perl to hang, I think it should be addressed somehow
in 5.12.1. It may be that the thing to do is just document it. It's
been around since 2007. I'm still looking at how things are done
currently, and a number of things appear wrong to me, but that's an
initial take, subject to further consideration.

Father Chrysostomos wrote​:

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of our
tests, and possibly existing code.

Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?

My first take is that I think we would just change the meanings. The
differences are quite minimal. ID_Start contains 23 more characters
than XID_Start​:
037A GREEK YPOGEGRAMMENI
0E33 THAI CHARACTER SARA AM
0EB3 LAO VOWEL SIGN AM
309B KATAKANA-HIRAGANA VOICED SOUND MARK
309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB ARABIC LIGATURE JALLAJALALOUHOU
FE70 ARABIC FATHATAN ISOLATED FORM
FE72 ARABIC DAMMATAN ISOLATED FORM
FE74 ARABIC KASRATAN ISOLATED FORM
FE76 ARABIC FATHA ISOLATED FORM
FE78 ARABIC DAMMA ISOLATED FORM
FE7A ARABIC KASRA ISOLATED FORM
FE7C ARABIC SHADDA ISOLATED FORM
FE7E ARABIC SUKUN ISOLATED FORM
FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

And ID_Continue contains 19 more characters than XID_Continue​:
037A GREEK YPOGEGRAMMENI
309B KATAKANA-HIRAGANA VOICED SOUND MARK
309C KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
FDFB ARABIC LIGATURE JALLAJALALOUHOU
FE70 ARABIC FATHATAN ISOLATED FORM
FE72 ARABIC DAMMATAN ISOLATED FORM
FE74 ARABIC KASRATAN ISOLATED FORM
FE76 ARABIC FATHA ISOLATED FORM
FE78 ARABIC DAMMA ISOLATED FORM
FE7A ARABIC KASRA ISOLATED FORM
FE7C ARABIC SHADDA ISOLATED FORM
FE7E ARABIC SUKUN ISOLATED FORM

So the differences are minimal; we would be recognizing 23 or 19 fewer
characters by going with the X versions. You can tell from some of the
names why it was wrong to put them in the original versions.

But I need to further study things to come up with a recommendation

In anticipation of this change, I’ve attached a patch that corrects the
test in utf8.t to use ¡ instead of ·. I’ve also moved the test outside
of the eval, so it will still run (and fail) if the compilation fails,
instead of causing an invalid test count.

Thanks. Have you considered adding a timeout? test.pl has one that will
kill the test script if Perl hangs.

@p5pRT
Copy link
Author

p5pRT commented May 3, 2010

From @cpansprout

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of our
tests, and possibly existing code.

Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?

In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
outside of the eval, so it will still run (and fail) if the
compilation fails, instead of causing an invalid test count.

@p5pRT
Copy link
Author

p5pRT commented May 3, 2010

From @cpansprout

Inline Patch
diff -Nurp blead/lib/utf8.t blead-74022/lib/utf8.t
--- blead/lib/utf8.t	2009-11-19 08:51:39.000000000 -0800
+++ blead-74022/lib/utf8.t	2010-04-30 18:11:43.000000000 -0700
@@ -329,8 +329,9 @@ END
 SKIP: {
     skip("Embedded UTF-8 does not work in EBCDIC", 1) if ord("A") == 193;
     use utf8;
-    eval qq{is(q \xc3\xbc test \xc3\xbc, qq\xc2\xb7 test \xc2\xb7,
-	       "utf8 quote delimiters [perl #16823]");};
+    is eval qq{q \xc3\xbc test \xc3\xbc . qq\xc2\xa1 test \xc2\xa1},
+      ' test  test ',
+      "utf8 quote delimiters [perl #16823]";
 }
 
 # Test the "internals".

@p5pRT
Copy link
Author

p5pRT commented May 3, 2010

From @khwilliamson

Father Chrysostomos wrote​:

On May 2, 2010, at 3​:39 PM, karl williamson wrote​:

Since this causes Perl to hang, I think it should be addressed somehow
in 5.12.1. It may be that the thing to do is just document it.

Perhaps die in scan_word if the identifier has no length?

That sounds good to me.

@p5pRT
Copy link
Author

p5pRT commented May 3, 2010

From @cpansprout

On May 2, 2010, at 3​:39 PM, karl williamson wrote​:

Since this causes Perl to hang, I think it should be addressed
somehow in 5.12.1. It may be that the thing to do is just document
it.

Perhaps die in scan_word if the identifier has no length?

It's been around since 2007. I'm still looking at how things are
done currently, and a number of things appear wrong to me, but
that's an initial take, subject to further consideration.

Father Chrysostomos wrote​:

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of
our tests, and possibly existing code.
Would we change the meanings of is_utf8_idcont and is_utf8_idfirst,
or introduce new functions?

But I need to further study things to come up with a recommendation

In anticipation of what I think your recommendation will be, I’m
sending a patch that uses the XID- variants.

In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
outside of the eval, so it will still run (and fail) if the
compilation fails, instead of causing an invalid test count.
Thanks. Have you considered adding a timeout? test.pl has one that
will kill the test script if Perl hangs.

That patch (open_jE1FFxzb.txt) was to fix the existing ‘qq· test ·’
test. That one doesn’t hang.

This new patch adds a test to t/comp/parser.t, which does not use test.pl
. Should it be changed so we can add a watchdog(60), or should the
test go elsewhere?

The open_jE1FFxzb.txt patch should be applied first, to keep tests
passing.

@p5pRT
Copy link
Author

p5pRT commented May 3, 2010

From @cpansprout

Inline Patch
diff -Nurp blead/handy.h blead-74022/handy.h
--- blead/handy.h	2010-04-26 02:12:12.000000000 -0700
+++ blead-74022/handy.h	2010-05-02 18:07:59.000000000 -0700
@@ -626,10 +626,9 @@ parameter, casts can silently truncate a
 #define isBLANK_LC_uni(c)	isBLANK(c) /* could be wrong */
 
 #define isALNUM_utf8(p)		is_utf8_alnum(p)
-/* The ID_Start of Unicode is quite limiting: it assumes a L-class
- * character (meaning that you cannot have, say, a CJK character).
- * Instead, let's allow ID_Continue but not digits. */
-#define isIDFIRST_utf8(p)	(is_utf8_idcont(p) && !is_utf8_digit(p))
+/* The isIDFIRST_utf8 macro has changed in the past. See
+   http://rt.perl.org/rt3/Public/Bug/Display.html?id=74022 for details. */
+#define isIDFIRST_utf8(p)	is_utf8_idfirst(p)
 #define isALPHA_utf8(p)		is_utf8_alpha(p)
 #define isSPACE_utf8(p)		is_utf8_space(p)
 #define isDIGIT_utf8(p)		is_utf8_digit(p)
diff -Nurp blead/t/comp/parser.t blead-74022/t/comp/parser.t
--- blead/t/comp/parser.t	2010-01-05 14:08:09.000000000 -0800
+++ blead-74022/t/comp/parser.t	2010-05-02 18:16:12.000000000 -0700
@@ -3,7 +3,7 @@
 # Checks if the parser behaves correctly in edge cases
 # (including weird syntax errors)
 
-print "1..122\n";
+print "1..123\n";
 
 sub failed {
     my ($got, $expected, $name) = @_;
@@ -353,6 +353,11 @@ eval q{
 };
 is($@, "", "multiline whitespace inside substitute expression");
 
+# bug #74022: Loop on characters in \p{OtherIDContinue}
+# This test hangs if it fails.
+eval chr 0x387;
+is(1,1, '[perl #74022] Parser looping on OtherIDContinue chars');
+
 # Add new tests HERE:
 
 # More awkward tests for #line. Keep these at the end, as they will screw
diff -Nurp blead/toke.c blead-74022/toke.c
--- blead/toke.c	2010-04-26 02:44:11.000000000 -0700
+++ blead-74022/toke.c	2010-05-02 18:08:59.000000000 -0700
@@ -11654,11 +11654,9 @@ S_scan_word(pTHX_ register char *s, char
 	    *d++ = *s++;
 	    *d++ = *s++;
 	}
-	else if (UTF && UTF8_IS_START(*s) && isALNUM_utf8((U8*)s)) {
+	else if (UTF && UTF8_IS_START(*s) && is_utf8_idcont((U8*)s)) {
 	    char *t = s + UTF8SKIP(s);
 	    size_t len;
-	    while (UTF8_IS_CONTINUED(*t) && is_utf8_mark((U8*)t))
-		t += UTF8SKIP(t);
 	    len = t - s;
 	    if (d + len > e)
 		Perl_croak(aTHX_ ident_too_long);
diff -Nurp blead/utf8.c blead-74022/utf8.c
--- blead/utf8.c	2010-04-15 01:33:10.000000000 -0700
+++ blead-74022/utf8.c	2010-05-02 18:07:13.000000000 -0700
@@ -1327,7 +1327,7 @@ Perl_is_utf8_idfirst(pTHX_ const U8 *p) 
     if (*p == '_')
 	return TRUE;
     /* is_utf8_idstart would be more logical. */
-    return is_utf8_common(p, &PL_utf8_idstart, "IdStart");
+    return is_utf8_common(p, &PL_utf8_idstart, "XIdStart");
 }
 
 bool
@@ -1339,7 +1339,7 @@ Perl_is_utf8_idcont(pTHX_ const U8 *p)
 
     if (*p == '_')
 	return TRUE;
-    return is_utf8_common(p, &PL_utf8_idcont, "IdContinue");
+    return is_utf8_common(p, &PL_utf8_idcont, "XIdContinue");
 }
 
 bool

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2010

From @cpansprout

On Sun May 02 17​:46​:08 2010, sprout wrote​:

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of our
tests, and possibly existing code.

Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?

In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
outside of the eval, so it will still run (and fail) if the
compilation fails, instead of causing an invalid test count.

I’ve applied this test patch (but not the fix for the original bug
reported) as 7b30141.

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2010

From [Unknown Contact. See original ticket]

On Sun May 02 17​:46​:08 2010, sprout wrote​:

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of our
tests, and possibly existing code.

Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?

In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
outside of the eval, so it will still run (and fail) if the
compilation fails, instead of causing an invalid test count.

I’ve applied this test patch (but not the fix for the original bug
reported) as 7b30141.

@p5pRT
Copy link
Author

p5pRT commented Sep 24, 2010

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun May 02 17​:46​:08 2010, sprout wrote​:

On Apr 27, 2010, at 6​:56 PM, karl williamson wrote​:

To summarize, I propose that we use Unicode's XID_Start and
XID_Continue properties in 5.14, even though that breaks one of our
tests, and possibly existing code.
Would we change the meanings of is_utf8_idcont and is_utf8_idfirst, or
introduce new functions?

In anticipation of this change, I’ve attached a patch that corrects
the test in utf8.t to use ¡ instead of ·. I’ve also moved the test
outside of the eval, so it will still run (and fail) if the
compilation fails, instead of causing an invalid test count.

I’ve applied this test patch (but not the fix for the original bug
reported) as 7b30141.

I had thought some about this and concluded that the only way to
guarantee things without breaking any backward compatibility is to make
the continuation consistent with our own screwed up definition of first.
  But I haven't gone further on it. We can also change the tokenizer to
not hang but die instead if caught in a loop, but that's not the root cause.

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From @cpansprout

On Fri Sep 24 13​:01​:54 2010, public@​khwilliamson.com wrote​:

I had thought some about this and concluded that the only way to
guarantee things without breaking any backward compatibility is to make
the continuation consistent with our own screwed up definition of first.
But I haven't gone further on it. We can also change the tokenizer to
not hang but die instead if caught in a loop, but that's not the root
cause.

I have thought about it a bit now. The problem with allowing Unicode
alphanumerics in identifiers is that Perl’s syntax changes subtly over
time. (So I do not think the ‘Unicode knows best’ rule, though
appropriate for \p, can apply to parsing.) This would not be a problem
if we only ever added to the allowed characters, but non-identifier
characters are allowed as delimiters. This means that we subtract from
the latter whenever we add to the former. We obviously cannot undo this
now, but we can avoid changing it more than necessary. To that end, I
have fixed the looping with commit d742518, but excluding from
isIDFIRST_utf8 the characters that were looping. No other characters
have been affected, so this is 100% backward-compatible (until the next
Unicode upgrade :-( ).

To make sure we don’t change q·foo· unknowingly, I’ve changed the test
back with commit ab10b0e.

(As a side note, ECMAScript allows Unicode characters in identifiers,
but that set only ever grows. Other non-ASCII characters are forbidden
outside of literals and comments, so the shrinking set problem does not
occur. CSS allows any non-ASCII characters in identifiers.)

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From [Unknown Contact. See original ticket]

On Fri Sep 24 13​:01​:54 2010, public@​khwilliamson.com wrote​:

I had thought some about this and concluded that the only way to
guarantee things without breaking any backward compatibility is to make
the continuation consistent with our own screwed up definition of first.
But I haven't gone further on it. We can also change the tokenizer to
not hang but die instead if caught in a loop, but that's not the root
cause.

I have thought about it a bit now. The problem with allowing Unicode
alphanumerics in identifiers is that Perl’s syntax changes subtly over
time. (So I do not think the ‘Unicode knows best’ rule, though
appropriate for \p, can apply to parsing.) This would not be a problem
if we only ever added to the allowed characters, but non-identifier
characters are allowed as delimiters. This means that we subtract from
the latter whenever we add to the former. We obviously cannot undo this
now, but we can avoid changing it more than necessary. To that end, I
have fixed the looping with commit d742518, but excluding from
isIDFIRST_utf8 the characters that were looping. No other characters
have been affected, so this is 100% backward-compatible (until the next
Unicode upgrade :-( ).

To make sure we don’t change q·foo· unknowingly, I’ve changed the test
back with commit ab10b0e.

(As a side note, ECMAScript allows Unicode characters in identifiers,
but that set only ever grows. Other non-ASCII characters are forbidden
outside of literals and comments, so the shrinking set problem does not
occur. CSS allows any non-ASCII characters in identifiers.)

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Nov 14, 2010
@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From tchrist@perl.com

I have thought about it a bit now. The problem with allowing
Unicode alphanumerics in identifiers is that Perl's syntax
changes subtly over time.

As someone who has within the last 24 hours spent a bit of time
in the Perl debugger with Unicode alphanumeric identifiers, I
can tell you it does *not* work very well.

This command​:

  % echo "ácütê" | perl -CS -d -S leo

Yields this kinda of garbage​:

  Loading DB routines from perl5db.pl version 1.33
  Editor support available.

  Enter h or `h h' for help, or `man perldebug' for more help.

  main​::(/Users/tomchristiansen/scripts/leo​:38)​:
  38​: main();
  DB<1> b flip_diacriticals
  DB<2> c
  main​::flip_diacriticals(/Users/tomchristiansen/scripts/leo​:135)​:
  135​: binmode(DATA, "​:utf8");
  DB<2> T
  Wide character in print at /usr/local/lib/perl5/5.12.2/perl5db.pl line 6789, <> line 1.
  Wide character in print at /usr/local/lib/perl5/5.12.2/perl5db.pl line 5724, <> line 1.
  at /usr/local/lib/perl5/5.12.2/perl5db.pl line 5724
  DB​::print_trace('GLOB(0x253060)', 1) called at /usr/local/lib/perl5/5.12.2/perl5db.pl line 2842
  DB​::DB called at /Users/tomchristiansen/scripts/leo line 135
  main​::flip_diacriticals('êtücá') called at /Users/tomchristiansen/scripts/leo line 121
  main​::reverse_mark_flip('ácütê') called at /Users/tomchristiansen/scripts/leo line 57
  main​::uÊ�opÉ�pá´�ƨdn('ácütê') called at /Users/tomchristiansen/scripts/leo line 47
  main​::main() called at /Users/tomchristiansen/scripts/leo line 38
  $ = main​::flip_diacriticals('êtücá') called from file `/Users/tomchristiansen/scripts/leo' line 121
  $ = main​::reverse_mark_flip('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 57
  $ = main​::uʍopəpᴉƨdn('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 47
  . = main​::main() called from file `/Users/tomchristiansen/scripts/leo' line 38

I've used -CS on the command line, and I've even used it
before -d. I have PERL_UNICODE=SA in my shell. My program
starts like this​:

  use 5.010_000;

  use utf8;
  use strict;
  use autodie;
  use warnings qw[ FATAL all ];
  use open qw[ :std :utf8 ];

I can't think of anything else to do.

Oh wait. Yes, I can!

  % echo "ácütê" | perl -CS -d -S leo
  Loading DB routines from perl5db.pl version 1.33
  Editor support available.

  Enter h or `h h' for help, or `man perldebug' for more help.

  main​::(/Users/tomchristiansen/scripts/leo​:38)​:
  38​: main();
  DB<1> binmode(DB​::OUT, "​:utf8") || die
  DB<2> b flip_diacriticals
  DB<3> c
  main​::flip_diacriticals(/Users/tomchristiansen/scripts/leo​:135)​:
  135​: binmode(DATA, "​:utf8");
  DB<3> T
  $ = main​::flip_diacriticals('êtücá') called from file `/Users/tomchristiansen/scripts/leo' line 121
  $ = main​::reverse_mark_flip('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 57
  $ = main​::uÊ�opÉ�pá´�ƨdn('M-acM-|tM-j') called from file `/Users/tomchristiansen/scripts/leo' line 47
  . = main​::main() called from file `/Users/tomchristiansen/scripts/leo' line 38

See, it's still garbage!

What am I supposed to do?

And watch this​:

  DB<3> b main​::uÊ�opÉ�pá´�ƨdn
  Subroutine main​::u not found.

That was entered as

  b main​::<TAB>

and it completed to that CRAP. Heck, even when I type

  b main​::uʍopəpᴉƨdn

it ignores me and displays

  b main​::uÊ�opÉ�pá´�ƨdn

and then again bitches about

  Subroutine main​::u not found.

To add injury to insult, that's illegal UTF-8 up there in its output!
What am I supposed to do about *that*?

It's just totally bollocksed, is what it is. :(

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From tchrist@perl.com

As someone who has within the last 24 hours spent a bit of time
in the Perl debugger with Unicode alphanumeric identifiers, I
can tell you it does *not* work very well.

Let me be more clear. Perl works perfectly well. So does
my program. It's just the Perl debugger that's broken.

--tom

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From @cpansprout

On Sun Nov 14 14​:32​:33 2010, tom christiansen wrote​:

As someone who has within the last 24 hours spent a bit of time
in the Perl debugger with Unicode alphanumeric identifiers, I
can tell you it does *not* work very well.

Let me be more clear. Perl works perfectly well. So does
my program. It's just the Perl debugger that's broken.

Could you send your previous message to perlbug@​perl.org without the
[perl #...] in the subject?

@p5pRT
Copy link
Author

p5pRT commented Nov 14, 2010

From [Unknown Contact. See original ticket]

On Sun Nov 14 14​:32​:33 2010, tom christiansen wrote​:

As someone who has within the last 24 hours spent a bit of time
in the Perl debugger with Unicode alphanumeric identifiers, I
can tell you it does *not* work very well.

Let me be more clear. Perl works perfectly well. So does
my program. It's just the Perl debugger that's broken.

Could you send your previous message to perlbug@​perl.org without the
[perl #...] in the subject?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant