Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delayed interpolation of \N{...} charnames escapes in regexes in perl 5.9.x and later causes breakage - they should be resolved and then converted to \x{...} not preserved verbatim #9397

Closed
p5pRT opened this issue Jun 29, 2008 · 138 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 29, 2008

Migrated from rt.perl.org#56444 (status was 'resolved')

Searchable as RT56444$

@p5pRT
Copy link
Author

p5pRT commented Jun 29, 2008

From chris@pirazzi.net

This is a bug report for perl from perlbug@​lurkertech.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.


Perl 5.10 (ActiveState ActivePerl Build 1003) breaks the following
script, as compared with Perl 5.8.8 (ActiveState ActivePerl Build
822)​:

use utf8;
use strict;
use English qw( -no_match_vars );
use charnames '​:full';
my $r1 = qr/\N{THAI CHARACTER SARA I}/;
my $s1 = "foo";
$s1 =~ /$r1+/;

The problem is that the last line errs with​:

Constant(\N{THAI CHARACTER SARA I}) unknown​: (possibly a missing "use
charnames...") in regex; marked by <-- HERE in m/(?-xism​:\N{THAI
CHARACTER SARA I} <-- HERE )+/ at t.pl line 7.

Note that I did use 'charnames' and that the \N{} DOES work in the first regex.
The err is in line 7, the last line, where the correctly compiled
regex gets re-interpolated.

In Perl 5.8.8, this script runs without error and the regex works as expected.

I did a bunch of google searches but could not find mention of this.

This might be related, but it is very old​:

http​://groups.google.co.th/group/perl.perl5.changes/browse_thread/thread/8a1489441e6e248/835a4e9963ac2011?lnk=st&q=perl+bug+%5CN+missing+charnames#835a4e9963ac2011

A workaround is​:

my $a = "\N{THAI CHARACTER SARA I}";
my $r1 = qr/$a/;
$s1 =~ /$r1+/;

however this is quite inconvenient as I have hundreds of regexes that
need change!

Thanks.



Flags​:
  category=core
  severity=medium


Site configuration information for perl 5.10.0​:

Configured by SYSTEM at Tue May 13 16​:52​:25 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration​:
  Platform​:
  osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
  uname=''
  config_args='undef'
  hint=recommended, useposix=true, d_sigaction=undef
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS
-DUSE_PERLIO -DPERL_MSVCRT_READFIX',
  optimize='-MD -Zi -DNDEBUG -O1',
  cppflags='-DWIN32'
  ccversion='14.0.50727', gccversion='', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
  ivtype='long', ivsize=4, nvtype='double', nvsize=8,
Off_t='__int64', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='link', ldflags ='-nologo -nodefaultlib -debug -opt​:ref,icf
-libpath​:"C​:\perl\lib\CORE" -machine​:x86'
  libpth=\lib
  libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib
odbc32.lib odbccp32.lib msvcrt.lib
  perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib
version.lib odbc32.lib odbccp32.lib msvcrt.lib
  libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt​:ref,icf -libpath​:"C​:\perl\lib\CORE" -machine​:x86'

Locally applied patches​:
  ACTIVEPERL_LOCAL_PATCHES_ENTRY
  33741 avoids segfaults invoking S_raise_signal() (on Linux)
  33763 Win32 process ids can have more than 16 bits
  32809 Load 'loadable object' with non-default file extension
  32728 64-bit fix for Time​::Local


@​INC for perl 5.10.0​:
  c​:/perl/site/lib
  c​:/perl/lib
  .


Environment for perl 5.10.0​:
  HOME=c​:\
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=C​:\perl\bin;C​:\tcl\bin;c​:\mysql\bin;C​:\WINDOWS\system32;C​:\WINDOWS;C​:\WINDOWS\System32\Wbem;C​:\Program
Files\Common Files\Roxio
Shared\DLLShared;C​:\s\Common7\IDE;C​:\s\VC\BIN;C​:\s\Common7\Tools;C​:\s\Common7\Tools\bin;C​:\s\VC\PlatformSDK\bin;C​:\s\SDK\v2.0\bin;C​:\WINDOWS\Microsoft.NET\Framework\v2.0.50727;C​:\s\VC\VCPackages;c​:\cygwin\bin;c​:\cygwin\usr\X11R6\bin;c​:\cygwin\usr\local\bin;c​:\bin;c​:\stlport\STLport-5.1.5\bin;c​:\icu\icu-3.4.1\bin;;C​:\graphviz\Graphviz\bin;c​:\doxygen\bin;C​:\quicktime\QTSystem\
  PERL_BADLANG (unset)
  SHELL=c​:\cygwin\bin\zsh.exe

@p5pRT
Copy link
Author

p5pRT commented Jun 29, 2008

From @andk

On Sun, 29 Jun 2008 00​:38​:27 -0700, "Chris Pirazzi" (via RT) <perlbug-followup@​perl.org> said​:

  > use utf8;
  > use strict;
  > use English qw( -no_match_vars );
  > use charnames '​:full';
  > my $r1 = qr/\N{THAI CHARACTER SARA I}/;
  > my $s1 = "foo";
  > $s1 =~ /$r1+/;

  > The problem is that the last line errs with​:

  > Constant(\N{THAI CHARACTER SARA I}) unknown​: (possibly a missing "use
  > charnames...") in regex; marked by <-- HERE in m/(?-xism​:\N{THAI
  > CHARACTER SARA I} <-- HERE )+/ at t.pl line 7.

Thanks for the report. The patch that broke this script was 28868​:

Change 28868 by merijn@​merijn-lt09 on 2006/09/19 06​:56​:36

  Subject​: Re​: \N{...} in regular expression [PATCH]
  From​: demerphq <demerphq@​gmail.com>
  Date​: Tue, 19 Sep 2006 01​:37​:19 +0200
  Message-ID​: <9b18b3110609181637m796d6c16o1b2741edc5f09eb2@​mail.gmail.com>

See also http​://rt.cpan.org/Ticket/Display.html?id=34388

HTH,
--
andreas

@p5pRT
Copy link
Author

p5pRT commented Jun 29, 2008

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jul 7, 2008

From @rgs

2008/6/29 via RT Chris Pirazzi <perlbug-followup@​perl.org>​:

Perl 5.10 (ActiveState ActivePerl Build 1003) breaks the following
script, as compared with Perl 5.8.8 (ActiveState ActivePerl Build
822)​:

use utf8;
use strict;
use English qw( -no_match_vars );
use charnames '​:full';
my $r1 = qr/\N{THAI CHARACTER SARA I}/;
my $s1 = "foo";
$s1 =~ /$r1+/;

The problem is that the last line errs with​:

Constant(\N{THAI CHARACTER SARA I}) unknown​: (possibly a missing "use
charnames...") in regex; marked by <-- HERE in m/(?-xism​:\N{THAI
CHARACTER SARA I} <-- HERE )+/ at t.pl line 7.

Note that I did use 'charnames' and that the \N{} DOES work in the first regex.
The err is in line 7, the last line, where the correctly compiled
regex gets re-interpolated.

Interestingly, if we wrap the code from "my $r1" to the end in an
eval(""), then it compiles correctly. So that's some kind of
time-of-loading problem.

@p5pRT
Copy link
Author

p5pRT commented Jul 8, 2008

From rick@bort.ca

On Jul 07 2008, Rafael Garcia-Suarez wrote​:

2008/6/29 via RT Chris Pirazzi <perlbug-followup@​perl.org>​:

use utf8;
use strict;
use English qw( -no_match_vars );
use charnames '​:full';
my $r1 = qr/\N{THAI CHARACTER SARA I}/;
my $s1 = "foo";
$s1 =~ /$r1+/;
[...]
Interestingly, if we wrap the code from "my $r1" to the end in an
eval(""), then it compiles correctly. So that's some kind of
time-of-loading problem.

I think that may just be because "\N{THAI CHARACTER SARA I}" is
interpolated before qr// gets it. It looks like the stringification of
qr// references has changed as a side effect of the structure changes.

  use charnames '​:full';
  use Devel​::Peek;
  $x = qr/\N{THAI CHARACTER SARA I}/;
  print $x;
  Dump $x

5.8.8 output


(?-xism​:ิ)
SV = RV(0x819cff4) at 0x81495d0
  REFCNT = 1
  FLAGS = (ROK,UTF8)
  RV = 0x8148d54
  SV = PVMG(0x8163050) at 0x8148d54
  REFCNT = 1
  FLAGS = (OBJECT,SMG)
  IV = 0
  NV = 0
  PV = 0
  MAGIC = 0x816b410
  MG_VIRTUAL = 0x8144608
  MG_TYPE = PERL_MAGIC_qr(r)
  MG_OBJ = 0x816b198
  MG_LEN = 12
  MG_PTR = 0x8163c10 "(?-xism​:\340\270\264)"
  STASH = 0x81490f0 "Regexp"

blead output


(?-xism​:\N{THAI CHARACTER SARA I})
SV = IV(0x83ac0fc) at 0x83ac100
  REFCNT = 1
  FLAGS = (ROK,UTF8)
  RV = 0x83ac0e0
  SV = REGEXP(0x83b65f0) at 0x83ac0e0
  REFCNT = 2
  FLAGS = (OBJECT,POK,pPOK,UTF8)
  IV = 0
  PV = 0x83b06b0 "(?-xism​:\\N{THAI CHARACTER SARA I})"\0 [UTF8 "(?-xism​:\N{THAI CHARACTER SARA I})"]
  CUR = 34
  LEN = 36
  STASH = 0x8397940 "Regexp"

--
Rick Delaney
rick@​bort.ca

@p5pRT
Copy link
Author

p5pRT commented Jul 10, 2008

From @demerphq

2008/7/8 Rick Delaney <rick@​bort.ca>​:

On Jul 07 2008, Rafael Garcia-Suarez wrote​:

2008/6/29 via RT Chris Pirazzi <perlbug-followup@​perl.org>​:

use utf8;
use strict;
use English qw( -no_match_vars );
use charnames '​:full';
my $r1 = qr/\N{THAI CHARACTER SARA I}/;
my $s1 = "foo";
$s1 =~ /$r1+/;
[...]
Interestingly, if we wrap the code from "my $r1" to the end in an
eval(""), then it compiles correctly. So that's some kind of
time-of-loading problem.

I think that may just be because "\N{THAI CHARACTER SARA I}" is
interpolated before qr// gets it. It looks like the stringification of
qr// references has changed as a side effect of the structure changes.

Yes this was the intent of the change, to prevent the conversion of
named escapes to strings before the regex engine saw them, otherwise
characters specified by the \N{} notation could/would be treated as
regex metachars, with not so cool consequences.

I really dont understand why its not working in this case. The second
pattern is in the same scope as the first, so this doesnt make a lot
of sense for me. I guess someone needs to look at in the debugger and
see why it thinks the charnames decl isnt in scope.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jan 8, 2009

From elliot@foobiebletch.com

Created by perl@galumph.com

Putting a variable expansion into a regex with a \N{} escape
results in a compilation error in 5.10.0. For example, the
following program compiles​:

  #!/usr/bin/env perl
  use charnames '​:full';
  m/\N{START OF HEADING}/

However, this

  #!/usr/bin/env perl
  use charnames '​:full';
  m/$x\N{START OF HEADING}/

results in

  Constant(\N{START OF HEADING}) unknown​: (possibly a missing
  "use charnames ...") in regex;

Perl Info

Flags:
    category=library
    severity=medium

Site configuration information for perl 5.10.0:

Configured by elliot at Sun Sep 14 15:07:20 CDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=darwin, osvers=9.4.0, archname=darwin-thread-multi-64int-ld-2level
    uname='darwin quaquaversal.local 9.4.0 darwin kernel version 9.4.0: mon jun 9 19:30:53 pdt 2008; root:xnu-1228.5.20~1release_i386 i386 '
    config_args='-Duse64bitint -Dusethreads -Dinc_version_list=none -Dprefix=/Users/elliot/opt/perl/perl-5.10.0-64bit-threads'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=define
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include',
    optimize='-O3',
    cppflags='-no-cpp-precomp -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -I/opt/local/include'
    ccversion='', gccversion='4.0.1 (Apple Inc. build 5465)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.5 cc', ldflags =' -L/usr/local/lib -L/opt/local/lib'
    libpth=/usr/local/lib /opt/local/lib /usr/lib
    libs=-ldbm -ldl -lm -lutil -lc
    perllibs=-ldl -lm -lutil -lc
    libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -L/opt/local/lib'

Locally applied patches:
    


@INC for perl 5.10.0:
    /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/5.10.0/darwin-thread-multi-64int-ld-2level
    /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/5.10.0
    /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/site_perl/5.10.0/darwin-thread-multi-64int-ld-2level
    /Users/elliot/opt/perl/perl-5.10.0-64bit-threads/lib/site_perl/5.10.0
    .


Environment for perl 5.10.0:
    DYLD_LIBRARY_PATH (unset)
    HOME=/Users/elliot
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/Users/elliot/bin:/Users/elliot/opt/bin:/Users/elliot/opt/perl/perl-5.10.0-64bit-threads/bin:/opt/local/bin:/usr/local/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11R6/bin:/usr/X11/bin
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Jan 8, 2009

From @moritz

Elliot Shank wrote​:

# New Ticket Created by Elliot Shank
# Please include the string​: [perl #62056]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=62056 >

This is a bug report for perl from perl@​galumph.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

Putting a variable expansion into a regex with a \N{} escape
results in a compilation error in 5.10.0. For example, the
following program compiles​:

\#\!/usr/bin/env perl
use charnames '&#8203;:full';
m/\\N\{START OF HEADING\}/

However, this

\#\!/usr/bin/env perl
use charnames '&#8203;:full';
m/$x\\N\{START OF HEADING\}/

results in

Constant\(\\N\{START OF HEADING\}\) unknown&#8203;: \(possibly a missing
"use charnames \.\.\."\) in regex;

This worked in perl-5.8.8, and fails in perl-5.10.0.
So I bisected it, and this is what git-bisect says is the offending commit​:

fc8cd66 is first bad commit
commit fc8cd66
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Tue Sep 19 03​:37​:19 2006 +0200

  Re​: \N{...} in regular expression [PATCH]
  Message-ID​:
<9b18b3110609181637m796d6c16o1b2741edc5f09eb2@​mail.gmail.com>

  p4raw-id​: //depot/perl@​28868

Cheers,
Moritz

@p5pRT
Copy link
Author

p5pRT commented Jan 8, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented May 28, 2009

From @nwc10

Dave notes​:

regression since 5.8.x
30/12/08 Yves says its tricky to fix

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @schwern

Bizarrely, it works in an eval.

$ perl5.10.0 -wle 'use charnames "​:full"; my $x = ""; print "\N{LATIN
CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; print
$@​'Constant(\N{LATIN CAPITAL LETTER E}) unknown​: (possibly a missing
"use charnames ...") in regex; marked by <-- HERE in m/\N{LATIN CAPITAL
LETTER E} <-- HERE / at -e line 1.

$ perl5.10.0 -wle 'use charnames "​:full"; my $x = ""; print eval
q{"\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/};
print $@​'
1

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @schwern

Looking at the code that would generate that specific error, there's
three places it could happen. Turns out its the one in regcomp.c.

  if (!table || !(PL_hints & HINT_LOCALIZE_HH)) {
  vFAIL2("Constant(\\N{%s}) unknown​: "
  "regcomp.c (possibly a missing \"use charnames ...\")",
  SvPVX(sv_name));
  }

Digging further, its the second clause which is false so its got the
wrong hints. For /$x\N{...}/ PL_hints has a value of 2**8 and for
/\N{...}/ its 131328 which is 2**8 + HINT_LOCALIZE_HH.

And that's about as far as I can go.

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @schwern

A test for this is available from git​://github.com/schwern/perl.git in
branch rt.cpan.org-62056. Also supplied here as a patch.

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @schwern

0001-Test-rt.cpan.org-62056.patch
From 7bed143fcc74b8bed3d7ed13de2ef000e5523b9e Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Sat, 11 Jul 2009 01:49:19 -0700
Subject: [PATCH] Test rt.cpan.org 62056

---
 t/op/pat.t |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/t/op/pat.t b/t/op/pat.t
index aa6299f..039ac50 100644
--- a/t/op/pat.t
+++ b/t/op/pat.t
@@ -13,7 +13,7 @@ sub run_tests;
 
 $| = 1;
 
-my $EXPECTED_TESTS = 4065;  # Update this when adding/deleting tests.
+my $EXPECTED_TESTS = 4066;  # Update this when adding/deleting tests.
 
 BEGIN {
     chdir 't' if -d 't';
@@ -1792,6 +1792,28 @@ sub run_tests {
     }
 
 
+    # rt.cpan.org 62056
+    # Problem with a variable before a \N{...} in a pattern match
+    {
+        package RT::62056;
+
+        # We need fresh_perl_is()
+        require './test.pl';
+
+        # Synch its test count with the one in pat.t
+        $test++;
+        curr_test($test);
+
+        local $RT::62056::TODO = "rt.cpan.org 62056";
+
+        fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
+use charnames ":full";
+$x = "";
+print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/;
+CODE
+    }
+
+
     {
         local $Message = "Final Sigma";
 
-- 
1.6.2.4

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @schwern

This appears to be a duplicate of 62056

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @rgs

2009/7/11 Michael G Schwern via RT <perlbug-followup@​perl.org>​:

A test for this is available from git​://github.com/schwern/perl.git in
branch rt.cpan.org-62056.  Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package
RT​::62056. Ok, you don't want to stomp on the ok() subroutine already
defined here. But then you need to do an awkward setting of
$RT​::62056​::TODO, and if you want to require test.pl in another test
in the same file, it will fail, because we have already test.pl in
%INC.

I'd probably go for the less effort and put that in a new file.

@p5pRT
Copy link
Author

p5pRT commented Jul 11, 2009

From @khwilliamson

Rafael Garcia-Suarez wrote​:

2009/7/11 Michael G Schwern via RT <perlbug-followup@​perl.org>​:

A test for this is available from git​://github.com/schwern/perl.git in
branch rt.cpan.org-62056. Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package
RT​::62056. Ok, you don't want to stomp on the ok() subroutine already
defined here. But then you need to do an awkward setting of
$RT​::62056​::TODO, and if you want to require test.pl in another test
in the same file, it will fail, because we have already test.pl in
%INC.

I'd probably go for the less effort and put that in a new file.

Here's 3 more lines for the patch, if you like, that reproduce the
similar #56444

my $r1 = qr/\N{THAI CHARACTER SARA I}/; #56444
my $s1 = "foo";
$s1 =~ /$r1+/;

FWIW, I had written about this a couple of weeks ago, and concluded that
probably both bugs were from the same root, and by forcing the code to
ignore the problem with HINT_LOCALIZE_HH caused 62056 to not fail, and
56444 failed later with a message that looked like if the root were
fixed it would succeed.

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2009

From @schwern

Rafael Garcia-Suarez wrote​:

2009/7/11 Michael G Schwern via RT <perlbug-followup@​perl.org>​:

A test for this is available from git​://github.com/schwern/perl.git in
branch rt.cpan.org-62056. Also supplied here as a patch.

I don't like this patch. You require() test.pl in a new package
RT​::62056. Ok, you don't want to stomp on the ok() subroutine already
defined here. But then you need to do an awkward setting of
$RT​::62056​::TODO, and if you want to require test.pl in another test
in the same file, it will fail, because we have already test.pl in
%INC.

I'd probably go for the less effort and put that in a new file.

I didn't like it either, but I didn't want to clean up the whole file nor lump
it into fresh_perl.t.

The real problem is op/pat.t is far too big. I think I'll start by spliting
all the charnames tests out. Good a start as any.

git​://github.com/schwern/perl.git branch rt.cpan.org-62056 has the changes.
On the way I also added note() to test.pl to replace the $Message system used
in pat.t. Patches attached.

Karl Williamson wrote​:

Here's 3 more lines for the patch, if you like, that reproduce the similar #56444

my $r1 = qr/\N{THAI CHARACTER SARA I}/; #56444
my $s1 = "foo";
$s1 =~ /$r1+/;

FWIW, I had written about this a couple of weeks ago, and concluded that
probably both bugs were from the same root, and by forcing the code to
ignore the problem with HINT_LOCALIZE_HH caused 62056 to not fail,
and 56444 failed later with a message that looked like if the root
were fixed it would succeed.

Yeah, I came to that conclusion too.

A possibly related note​:

$ perl5.10.0 -wl
use charnames "​:full";
print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/;
\N{THAI CHARACTER SARA I}+ matches null string many times in regex; marked by
<-- HERE in m/\N{THAI CHARACTER SARA I}+ <-- HERE / at - line 2.
Yes

$ perl5.10.1 -wl
use charnames "​:full";
print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/;
\N{THAI CHARACTER SARA I}+ matches null string many times in regex; marked by
<-- HERE in m/\N{THAI CHARACTER SARA I}+ <-- HERE / at - line 2.
Yes

I don't know if that's a bug or a feature, but is a test now.

--
60. "The Giant Space Ants" are not at the top of my chain of command.
  -- The 213 Things Skippy Is No Longer Allowed To Do In The U.S. Army
  http​://skippyslist.com/list/

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2009

From @schwern

0001-Test-rt.cpan.org-62056.patch
From 7bed143fcc74b8bed3d7ed13de2ef000e5523b9e Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Sat, 11 Jul 2009 01:49:19 -0700
Subject: [PATCH 1/4] Test rt.cpan.org 62056

---
 t/op/pat.t |   24 +++++++++++++++++++++++-
 1 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/t/op/pat.t b/t/op/pat.t
index aa6299f..039ac50 100644
--- a/t/op/pat.t
+++ b/t/op/pat.t
@@ -13,7 +13,7 @@ sub run_tests;
 
 $| = 1;
 
-my $EXPECTED_TESTS = 4065;  # Update this when adding/deleting tests.
+my $EXPECTED_TESTS = 4066;  # Update this when adding/deleting tests.
 
 BEGIN {
     chdir 't' if -d 't';
@@ -1792,6 +1792,28 @@ sub run_tests {
     }
 
 
+    # rt.cpan.org 62056
+    # Problem with a variable before a \N{...} in a pattern match
+    {
+        package RT::62056;
+
+        # We need fresh_perl_is()
+        require './test.pl';
+
+        # Synch its test count with the one in pat.t
+        $test++;
+        curr_test($test);
+
+        local $RT::62056::TODO = "rt.cpan.org 62056";
+
+        fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
+use charnames ":full";
+$x = "";
+print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/;
+CODE
+    }
+
+
     {
         local $Message = "Final Sigma";
 
-- 
1.6.2.4

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2009

From @schwern

0002-Add-note-to-test.pl-like-Test-More-s.patch
From fcbd249a3f121bc7465f6b95be3ba885d75f14da Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Sat, 11 Jul 2009 16:25:46 -0700
Subject: [PATCH 2/4] Add note() to test.pl like Test::More's

---
 t/test.pl |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/t/test.pl b/t/test.pl
index 32c4a37..332fc60 100644
--- a/t/test.pl
+++ b/t/test.pl
@@ -67,16 +67,24 @@ END {
 # Use this instead of "print STDERR" when outputing failure diagnostic
 # messages
 sub _diag {
-    return unless @_;
-    my @mess = map { /^#/ ? "$_\n" : "# $_\n" }
-               map { split /\n/ } @_;
+    my @mess = _comment(@_);
     $TODO ? _print(@mess) : _print_stderr(@mess);
 }
 
+sub _comment {
+    return unless @_;
+    return map { /^#/ ? "$_\n" : "# $_\n" }
+           map { split /\n/ } @_;
+}
+
 sub diag {
     _diag(@_);
 }
 
+sub note {
+    _print _comment(@_);
+}
+
 sub skip_all {
     if (@_) {
         _print "1..0 # Skip @_\n";
-- 
1.6.2.4

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2009

From @schwern

0003-Chop-out-the-tests-from-op-pat.t-which-involve-using.patch
From 18fe84f50714f3413e090b9a7111dd7edb3c6351 Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Sat, 11 Jul 2009 16:25:59 -0700
Subject: [PATCH 3/4] Chop out the tests from op/pat.t which involve using charnames and put them into their own test file.

The $Message system is replaced by note().  Most of the special case work to identify which test is which is unnecessary in a shorter test file.

may_not_warn() should probably be pushed into test.pl
---
 t/op/pat.t              |  195 +-------------------------------------------
 t/op/regexp_charnames.t |  208 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 194 deletions(-)
 create mode 100644 t/op/regexp_charnames.t

diff --git a/t/op/pat.t b/t/op/pat.t
index 039ac50..192a972 100644
--- a/t/op/pat.t
+++ b/t/op/pat.t
@@ -13,7 +13,7 @@ sub run_tests;
 
 $| = 1;
 
-my $EXPECTED_TESTS = 4066;  # Update this when adding/deleting tests.
+my $EXPECTED_TESTS = 4002;  # Update this when adding/deleting tests.
 
 BEGIN {
     chdir 't' if -d 't';
@@ -1726,95 +1726,6 @@ sub run_tests {
 
 
     {
-        use charnames ':full';
-        local $Message = "Folding 'LATIN LETTER A WITH GRAVE'";
-
-        my $lower = "\N{LATIN SMALL LETTER A WITH GRAVE}";
-        my $UPPER = "\N{LATIN CAPITAL LETTER A WITH GRAVE}";
-        
-        ok $lower =~ m/$UPPER/i;
-        ok $UPPER =~ m/$lower/i;
-        ok $lower =~ m/[$UPPER]/i;
-        ok $UPPER =~ m/[$lower]/i;
-
-        local $Message = "Folding 'GREEK LETTER ALPHA WITH VRACHY'";
-
-        $lower = "\N{GREEK CAPITAL LETTER ALPHA WITH VRACHY}";
-        $UPPER = "\N{GREEK SMALL LETTER ALPHA WITH VRACHY}";
-
-        ok $lower =~ m/$UPPER/i;
-        ok $UPPER =~ m/$lower/i;
-        ok $lower =~ m/[$UPPER]/i;
-        ok $UPPER =~ m/[$lower]/i;
-
-        local $Message = "Folding 'LATIN LETTER Y WITH DIAERESIS'";
-
-        $lower = "\N{LATIN SMALL LETTER Y WITH DIAERESIS}";
-        $UPPER = "\N{LATIN CAPITAL LETTER Y WITH DIAERESIS}";
-
-        ok $lower =~ m/$UPPER/i;
-        ok $UPPER =~ m/$lower/i;
-        ok $lower =~ m/[$UPPER]/i;
-        ok $UPPER =~ m/[$lower]/i;
-    }
-
-
-    {
-        use charnames ':full';
-        local $PatchId = "13843";
-        local $Message = "GREEK CAPITAL LETTER SIGMA vs " .
-                         "COMBINING GREEK PERISPOMENI";
-
-        my $SIGMA = "\N{GREEK CAPITAL LETTER SIGMA}";
-        my $char  = "\N{COMBINING GREEK PERISPOMENI}";
-
-        may_not_warn sub {ok "_:$char:_" !~ m/_:$SIGMA:_/i};
-    }
-
-
-    {
-        local $Message = '\X';
-        use charnames ':full';
-
-        ok "a!"                          =~ /^(\X)!/ && $1 eq "a";
-        ok "\xDF!"                       =~ /^(\X)!/ && $1 eq "\xDF";
-        ok "\x{100}!"                    =~ /^(\X)!/ && $1 eq "\x{100}";
-        ok "\x{100}\x{300}!"             =~ /^(\X)!/ && $1 eq "\x{100}\x{300}";
-        ok "\N{LATIN CAPITAL LETTER E}!" =~ /^(\X)!/ &&
-               $1 eq "\N{LATIN CAPITAL LETTER E}";
-        ok "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}!"
-                                         =~ /^(\X)!/ &&
-               $1 eq "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}";
-
-        local $Message = '\C and \X';
-        ok "!abc!" =~ /a\Cc/;
-        ok "!abc!" =~ /a\Xc/;
-    }
-
-
-    # rt.cpan.org 62056
-    # Problem with a variable before a \N{...} in a pattern match
-    {
-        package RT::62056;
-
-        # We need fresh_perl_is()
-        require './test.pl';
-
-        # Synch its test count with the one in pat.t
-        $test++;
-        curr_test($test);
-
-        local $RT::62056::TODO = "rt.cpan.org 62056";
-
-        fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
-use charnames ":full";
-$x = "";
-print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/;
-CODE
-    }
-
-
-    {
         local $Message = "Final Sigma";
 
         my $SIGMA = "\x{03A3}"; # CAPITAL
@@ -1860,46 +1771,6 @@ CODE
 
 
     {
-        use charnames ':full';
-        local $Message = "Parlez-Vous " .
-                         "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais?";
-
-        ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran.ais/ &&
-            $& eq "Francais";
-        ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran.ais/ &&
-            $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
-        ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Cais/ &&
-            $& eq "Francais";
-        # COMBINING CEDILLA is two bytes when encoded
-        ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\C\Cais/;
-        ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Xais/ &&
-            $& eq "Francais";
-        ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran\Xais/  &&
-            $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
-        ok "Franc\N{COMBINING CEDILLA}ais" =~ /Fran\Xais/ &&
-            $& eq "Franc\N{COMBINING CEDILLA}ais";
-        ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~
-           /Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais/  &&
-            $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
-        ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\N{COMBINING CEDILLA}ais/ &&
-            $& eq "Franc\N{COMBINING CEDILLA}ais";
-
-        my @f = (
-            ["Fran\N{LATIN SMALL LETTER C}ais",                    "Francais"],
-            ["Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais",
-                               "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"],
-            ["Franc\N{COMBINING CEDILLA}ais", "Franc\N{COMBINING CEDILLA}ais"],
-        );
-        foreach my $entry (@f) {
-            my ($subject, $match) = @$entry;
-            ok $subject =~ /Fran(?:c\N{COMBINING CEDILLA}?|
-                    \N{LATIN SMALL LETTER C WITH CEDILLA})ais/x &&
-               $& eq $match;
-        }
-    }
-
-
-    {
         local $Message = "Lingering (and useless) UTF8 flag doesn't mess up /i";
         my $pat = "ABcde";
         my $str = "abcDE\x{100}";
@@ -1920,38 +1791,6 @@ CODE
 
 
     {
-        use charnames ':full';
-        local $Message = "LATIN SMALL LETTER SHARP S " .
-                         "(\N{LATIN SMALL LETTER SHARP S})";
-
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-                                            /\N{LATIN SMALL LETTER SHARP S}/;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-                                            /\N{LATIN SMALL LETTER SHARP S}/i;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-                                           /[\N{LATIN SMALL LETTER SHARP S}]/;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-                                           /[\N{LATIN SMALL LETTER SHARP S}]/i;
-
-        ok "ss" =~  /\N{LATIN SMALL LETTER SHARP S}/i;
-        ok "SS" =~  /\N{LATIN SMALL LETTER SHARP S}/i;
-        ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i;
-        ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i;
-
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~ /SS/i;
- 
-        local $Message = "Unoptimized named sequence in class";
-        ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i;
-        ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-          /[\N{LATIN SMALL LETTER SHARP S}x]/;
-        ok "\N{LATIN SMALL LETTER SHARP S}" =~
-          /[\N{LATIN SMALL LETTER SHARP S}x]/i;
-    }
-
-
-    {
         # More whitespace: U+0085, U+2028, U+2029\n";
 
         # U+0085, U+00A0 need to be forced to be Unicode, the \x{100} does that.
@@ -2930,23 +2769,6 @@ CODE
 
 
     {
-        use charnames ':full';
-
-        ok 'aabc' !~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against aabc';
-        ok 'a+bc' =~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against a+bc';
-
-        ok ' A B' =~ /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/,
-            'Intermixed named and unicode escapes';
-        ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~
-           /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/,
-            'Intermixed named and unicode escapes';
-        ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~
-           /[\N{SPACE}\N{U+0041}][\N{SPACE}\N{U+0042}]/,
-            'Intermixed named and unicode escapes';     
-    }
-
-
-    {
         our $brackets;
         $brackets = qr{
             {  (?> [^{}]+ | (??{ $brackets }) )* }
@@ -3655,21 +3477,6 @@ CODE
 
 
     {
-        use charnames ":full";
-        ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "I =~ Alphabetic";
-        ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Uppercase}/,  "I =~ Uppercase";
-        ok "\N{ROMAN NUMERAL ONE}" !~ /\p{Lowercase}/,  "I !~ Lowercase";
-        ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDStart}/,    "I =~ ID_Start";
-        ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "I =~ ID_Continue";
-        ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "i =~ Alphabetic";
-        ok "\N{SMALL ROMAN NUMERAL ONE}" !~ /\p{Uppercase}/,  "i !~ Uppercase";
-        ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Lowercase}/,  "i =~ Lowercase";
-        ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDStart}/,    "i =~ ID_Start";
-        ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "i =~ ID_Continue"
-    }
-
-
-    {
         # requirement of Unicode Technical Standard #18, 1.7 Code Points
         # cf. http://www.unicode.org/reports/tr18/#Supplementary_Characters
         for my $u (0x7FF, 0x800, 0xFFFF, 0x10000) {
diff --git a/t/op/regexp_charnames.t b/t/op/regexp_charnames.t
new file mode 100644
index 0000000..e128671
--- /dev/null
+++ b/t/op/regexp_charnames.t
@@ -0,0 +1,208 @@
+#!./perl
+
+# This is a test of regexes problems which involve charnames.
+
+use strict;
+use warnings;
+use 5.010;
+
+BEGIN {
+    chdir 't' if -d 't';
+    @INC = '../lib';
+    require "./test.pl";
+}
+
+plan tests => 64;
+
+
+sub may_not_warn {
+    my ($code, $name) = @_;
+    my $w = '';
+    local $SIG {__WARN__} = sub {$w .= join "" => @_};
+    use warnings 'all';
+    ref $code ? &$code : eval $code;
+    is $w, "", $name // "Did not warn";
+}
+
+
+{
+    use charnames ':full';
+    note "Folding 'LATIN LETTER A WITH GRAVE'";
+
+    my $lower = "\N{LATIN SMALL LETTER A WITH GRAVE}";
+    my $UPPER = "\N{LATIN CAPITAL LETTER A WITH GRAVE}";
+        
+    ok $lower =~ m/$UPPER/i;
+    ok $UPPER =~ m/$lower/i;
+    ok $lower =~ m/[$UPPER]/i;
+    ok $UPPER =~ m/[$lower]/i;
+
+    note "Folding 'GREEK LETTER ALPHA WITH VRACHY'";
+
+    $lower = "\N{GREEK CAPITAL LETTER ALPHA WITH VRACHY}";
+    $UPPER = "\N{GREEK SMALL LETTER ALPHA WITH VRACHY}";
+
+    ok $lower =~ m/$UPPER/i;
+    ok $UPPER =~ m/$lower/i;
+    ok $lower =~ m/[$UPPER]/i;
+    ok $UPPER =~ m/[$lower]/i;
+
+    note "Folding 'LATIN LETTER Y WITH DIAERESIS'";
+
+    $lower = "\N{LATIN SMALL LETTER Y WITH DIAERESIS}";
+    $UPPER = "\N{LATIN CAPITAL LETTER Y WITH DIAERESIS}";
+
+    ok $lower =~ m/$UPPER/i;
+    ok $UPPER =~ m/$lower/i;
+    ok $lower =~ m/[$UPPER]/i;
+    ok $UPPER =~ m/[$lower]/i;
+}
+
+
+{
+    use charnames ':full';
+    note "Patch 13843";
+    note "GREEK CAPITAL LETTER SIGMA vs COMBINING GREEK PERISPOMENI";
+
+    my $SIGMA = "\N{GREEK CAPITAL LETTER SIGMA}";
+    my $char  = "\N{COMBINING GREEK PERISPOMENI}";
+
+    may_not_warn sub {ok "_:$char:_" !~ m/_:$SIGMA:_/i};
+}
+
+
+{
+    use charnames ':full';
+
+    note '\X';
+    ok "a!"                          =~ /^(\X)!/ && $1 eq "a";
+    ok "\xDF!"                       =~ /^(\X)!/ && $1 eq "\xDF";
+    ok "\x{100}!"                    =~ /^(\X)!/ && $1 eq "\x{100}";
+    ok "\x{100}\x{300}!"             =~ /^(\X)!/ && $1 eq "\x{100}\x{300}";
+    ok "\N{LATIN CAPITAL LETTER E}!" =~ /^(\X)!/ &&
+      $1 eq "\N{LATIN CAPITAL LETTER E}";
+    ok "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}!"
+      =~ /^(\X)!/ &&
+        $1 eq "\N{LATIN CAPITAL LETTER E}\N{COMBINING GRAVE ACCENT}";
+
+    note '\C and \X';
+    ok "!abc!" =~ /a\Cc/;
+    ok "!abc!" =~ /a\Xc/;
+}
+
+
+# rt.cpan.org 62056
+# Problem with a variable before a \N{...} in a pattern match
+{
+    local $::TODO = "rt.cpan.org 62056";
+
+    fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
+use charnames ":full";
+$x = "";
+print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/;
+CODE
+}
+
+
+{
+    use charnames ':full';
+    note "Parlez-Vous Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais?";
+
+    ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran.ais/ &&
+      $& eq "Francais";
+    ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran.ais/ &&
+      $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
+    ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Cais/ &&
+      $& eq "Francais";
+    # COMBINING CEDILLA is two bytes when encoded
+    ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\C\Cais/;
+    ok "Fran\N{LATIN SMALL LETTER C}ais" =~ /Fran\Xais/ &&
+      $& eq "Francais";
+    ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~ /Fran\Xais/  &&
+      $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
+    ok "Franc\N{COMBINING CEDILLA}ais" =~ /Fran\Xais/ &&
+      $& eq "Franc\N{COMBINING CEDILLA}ais";
+    ok "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais" =~
+      /Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais/  &&
+        $& eq "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais";
+    ok "Franc\N{COMBINING CEDILLA}ais" =~ /Franc\N{COMBINING CEDILLA}ais/ &&
+      $& eq "Franc\N{COMBINING CEDILLA}ais";
+
+    my @f = (
+        ["Fran\N{LATIN SMALL LETTER C}ais",                    "Francais"],
+        ["Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais",
+         "Fran\N{LATIN SMALL LETTER C WITH CEDILLA}ais"],
+        ["Franc\N{COMBINING CEDILLA}ais", "Franc\N{COMBINING CEDILLA}ais"],
+    );
+    foreach my $entry (@f) {
+        my ($subject, $match) = @$entry;
+        ok $subject =~ /Fran(?:c\N{COMBINING CEDILLA}?|
+                            \N{LATIN SMALL LETTER C WITH CEDILLA})ais/x &&
+                              $& eq $match;
+    }
+}
+
+
+{
+    use charnames ':full';
+    note "LATIN SMALL LETTER SHARP S (\N{LATIN SMALL LETTER SHARP S})";
+
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /\N{LATIN SMALL LETTER SHARP S}/;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /\N{LATIN SMALL LETTER SHARP S}/i;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /[\N{LATIN SMALL LETTER SHARP S}]/;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /[\N{LATIN SMALL LETTER SHARP S}]/i;
+
+    ok "ss" =~  /\N{LATIN SMALL LETTER SHARP S}/i;
+    ok "SS" =~  /\N{LATIN SMALL LETTER SHARP S}/i;
+    ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i;
+    ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i;
+
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~ /ss/i;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~ /SS/i;
+ 
+    note "Unoptimized named sequence in class";
+    ok "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i;
+    ok "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}x]/i;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /[\N{LATIN SMALL LETTER SHARP S}x]/;
+    ok "\N{LATIN SMALL LETTER SHARP S}" =~
+      /[\N{LATIN SMALL LETTER SHARP S}x]/i;
+}
+
+
+{
+    use charnames ':full';
+
+    ok 'aabc' !~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against aabc';
+    ok 'a+bc' =~ /a\N{PLUS SIGN}b/, '/a\N{PLUS SIGN}b/ against a+bc';
+
+    ok ' A B' =~ /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/,
+      'Intermixed named and unicode escapes';
+    ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~
+      /\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}/,
+        'Intermixed named and unicode escapes';
+    ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~
+      /[\N{SPACE}\N{U+0041}][\N{SPACE}\N{U+0042}]/,
+        'Intermixed named and unicode escapes';     
+}
+
+
+{
+    use charnames ":full";
+    ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "I =~ Alphabetic";
+    ok "\N{ROMAN NUMERAL ONE}" =~ /\p{Uppercase}/,  "I =~ Uppercase";
+    ok "\N{ROMAN NUMERAL ONE}" !~ /\p{Lowercase}/,  "I !~ Lowercase";
+    ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDStart}/,    "I =~ ID_Start";
+    ok "\N{ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "I =~ ID_Continue";
+    ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Alphabetic}/, "i =~ Alphabetic";
+    ok "\N{SMALL ROMAN NUMERAL ONE}" !~ /\p{Uppercase}/,  "i !~ Uppercase";
+    ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{Lowercase}/,  "i =~ Lowercase";
+    ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDStart}/,    "i =~ ID_Start";
+    ok "\N{SMALL ROMAN NUMERAL ONE}" =~ /\p{IDContinue}/, "i =~ ID_Continue"
+}
+
+
-- 
1.6.2.4

@p5pRT
Copy link
Author

p5pRT commented Jul 12, 2009

From @schwern

0004-Add-tests-for-similar-rt.cpan.org-56444.patch
From 5fc9a94f55277ea133571f6e2bc52019b7ed240b Mon Sep 17 00:00:00 2001
From: Michael G. Schwern <schwern@pobox.com>
Date: Sat, 11 Jul 2009 16:43:13 -0700
Subject: [PATCH 4/4] Add tests for similar rt.cpan.org 56444

---
 t/op/regexp_charnames.t |   27 +++++++++++++++++++++++++--
 1 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/t/op/regexp_charnames.t b/t/op/regexp_charnames.t
index e128671..b7d9b88 100644
--- a/t/op/regexp_charnames.t
+++ b/t/op/regexp_charnames.t
@@ -12,7 +12,7 @@ BEGIN {
     require "./test.pl";
 }
 
-plan tests => 64;
+plan tests => 67;
 
 
 sub may_not_warn {
@@ -93,14 +93,37 @@ sub may_not_warn {
 
 # rt.cpan.org 62056
 # Problem with a variable before a \N{...} in a pattern match
+# Regressions in 5.10.0 from 5.8.8.
 {
-    local $::TODO = "rt.cpan.org 62056";
+    use charnames ":full";
+
+    local $::TODO = "rt.cpan.org 62056 and 56444";
 
+    # rt.cpan.org 62056
     fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
 use charnames ":full";
 $x = "";
 print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/;
 CODE
+
+    fresh_perl_is(<<'CODE', "Yes", {}, 'variable before \N{...}');
+use charnames ":full";
+$x = "";
+print "Yes" if "$x\N{LATIN CAPITAL LETTER E}" =~ /$x \N{LATIN CAPITAL LETTER E}/x;
+CODE
+
+    # rt.cpan.org 56444
+    fresh_perl_is(<<'CODE', "Yes", {}, '\N{...}+' );
+use charnames ":full";
+my $r1 = qr/\N{THAI CHARACTER SARA I}/;
+my $s1 = "\N{THAI CHARACTER SARA I}" x 2;
+print "Yes" if $s1 =~ /$r1+/;
+CODE
+
+    fresh_perl_is(<<'CODE', "Yes", { switches => ['-w'], stderr => 1 });
+use charnames ":full";
+print "Yes" if "\N{THAI CHARACTER SARA I}" =~ /\N{THAI CHARACTER SARA I}+/;
+CODE
 }
 
 
-- 
1.6.2.4

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2009

From @demerphq

2009/1/8 Moritz Lenz <moritz@​casella.verplant.org>​:

Elliot Shank wrote​:

# New Ticket Created by  Elliot Shank
# Please include the string​:  [perl #62056]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=62056 >

This is a bug report for perl from perl@​galumph.com,
generated with the help of perlbug 1.36 running under perl 5.10.0.

-----------------------------------------------------------------
[Please enter your report here]

Putting a variable expansion into a regex with a \N{} escape
results in a compilation error in 5.10.0.  For example, the
following program compiles​:

    #!/usr/bin/env perl
    use charnames '​:full';
    m/\N{START OF HEADING}/

However, this

    #!/usr/bin/env perl
    use charnames '​:full';
    m/$x\N{START OF HEADING}/

results in

    Constant(\N{START OF HEADING}) unknown​: (possibly a missing
    "use charnames ...") in regex;

This worked in perl-5.8.8, and fails in perl-5.10.0.
So I bisected it, and this is what git-bisect says is the offending commit​:

fc8cd66 is first bad commit
commit fc8cd66
Author​: Yves Orton <demerphq@​gmail.com>
Date​:   Tue Sep 19 03​:37​:19 2006 +0200

   Re​: \N{...} in regular expression [PATCH]
   Message-ID​:
<9b18b3110609181637m796d6c16o1b2741edc5f09eb2@​mail.gmail.com>

   p4raw-id​: //depot/perl@​28868

I think the right solution to this problem is to fix charnames.

Making charnames lexically scoped poses serious conceptual
difficulties in the regex engine, for IMO very very little benefit.

IMO we should just make \N{} escapes work always. And disable this
silly "charnames not in scope" behaviour. At least in regex patterns.
I mean what do we gain?

Yes
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

p5pRT commented Jul 15, 2009

From @schwern

demerphq wrote​:

I think the right solution to this problem is to fix charnames.

Making charnames lexically scoped poses serious conceptual
difficulties in the regex engine, for IMO very very little benefit.

IMO we should just make \N{} escapes work always. And disable this
silly "charnames not in scope" behaviour. At least in regex patterns.
I mean what do we gain?

(Note​: this is all by someone who doesn't really do much Unicode)

Oddly enough, this is a case where a wild stretch of backwards compatibility
was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"'
N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"'
Constant(\N{...}) unknown​: (possibly a missing "use charnames ...") at -e line
1, within string
Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

So the argument that a lexical charnames is protecting against code which does
not use charnames is bogus since \N{...} is already globally broken without it.

Seems to me the problem is there's not just one charnames. There's lots of
them. :full, :short, :alias, greek, cyrillic... and you can even define you
own. How do you know which one is in use?

This comes down to how charnames works. There's not a big table somewhere,
you export a "translator" function... which probably looks at some big table
on disk. But it means only one translator can be in effect at any given time.
This seems to me like overkill.

On the one hand, who cares? Its not like its bad if there are too many
charname symbols. Consider \N{...} a big namespace and leave it up to the
charnames authors to be polite and not clobber each other.

If someone writes a charnames extension I'd like to use why do I have to
exclude all the others?

  use Encode​::JP​::Mobile​::Charnames;
  use charnames "​:full";
  binmode STDOUT, "​:utf8";
  print "\N{DoCoMo Beer}\n\N{GREEK SMALL LETTER SIGMA}\n";
  __END__
  Unknown charname 'DoCoMo Beer' at
/usr/local/perl/5.10.0/lib/5.10.0/unicore/Name.pl line 1
  �
  σ

Case in point, Encode​::JP​::Mobile​::Charnames (the only custom charnames module
I could find on CPAN) works around this by falling back to
charnames​::charnames(). Of course this hack only works if I load it AFTER
charnames.

So in order to unlexicalize charnames an additive system would have to be put
in place. Perhaps something as simple as a list of translation functions to
try. Just keep trying until one works.

--
You are wicked and wrong to have broken inside and peeked at the
implementation and then relied upon it.
  -- tchrist in <31832.969261130@​chthon>

@p5pRT
Copy link
Author

p5pRT commented Jul 16, 2009

From ben@morrow.me.uk

Quoth schwern@​pobox.com (Michael G Schwern)​:

Seems to me the problem is there's not just one charnames. There's lots of
them. :full, :short, :alias, greek, cyrillic... and you can even define you
own. How do you know which one is in use?

This comes down to how charnames works. There's not a big table somewhere,
you export a "translator" function... which probably looks at some big table
on disk. But it means only one translator can be in effect at any given time.
This seems to me like overkill.

On the one hand, who cares? Its not like its bad if there are too many
charname symbols. Consider \N{...} a big namespace and leave it up to the
charnames authors to be polite and not clobber each other.

But what about

  ~% perl -E'
  {use charnames "latin";
  say charnames​::viacode(ord "\N{upsilon}")}
  {use charnames "greek";
  say charnames​::viacode(ord "\N{upsilon}")}'
  LATIN SMALL LETTER UPSILON
  GREEK SMALL LETTER UPSILON

and other cases of conflict? One of the points of charnames is to avoid
having to say LATIN SMALL LETTER BLAH WITH MANY EXTRA VERY VERBOSE
DIACRITICALS all the time, in favour of shorter but ambiguous names​: it
would be a shame to lose that.

Ben

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

From @rgs

2009/7/16 Ben Morrow <ben@​morrow.me.uk>​:

Quoth schwern@​pobox.com (Michael G Schwern)​:

Seems to me the problem is there's not just one charnames.  There's lots of
them.  :full, :short, :alias, greek, cyrillic... and you can even define you
own.  How do you know which one is in use?

This comes down to how charnames works.  There's not a big table somewhere,
you export a "translator" function... which probably looks at some big table
on disk.  But it means only one translator can be in effect at any given time.
 This seems to me like overkill.

On the one hand, who cares?  Its not like its bad if there are too many
charname symbols.  Consider \N{...} a big namespace and leave it up to the
charnames authors to be polite and not clobber each other.

But what about

   ~% perl -E'
       {use charnames "latin";
           say charnames​::viacode(ord "\N{upsilon}")}
       {use charnames "greek";
           say charnames​::viacode(ord "\N{upsilon}")}'
   LATIN SMALL LETTER UPSILON
   GREEK SMALL LETTER UPSILON

and other cases of conflict? One of the points of charnames is to avoid
having to say LATIN SMALL LETTER BLAH WITH MANY EXTRA VERY VERBOSE
DIACRITICALS all the time, in favour of shorter but ambiguous names​: it
would be a shame to lose that.

Good point. I like Yves' suggestion, because it's simple. I think that
Karl was doing something about the ambiguous names you're pointing to,
but I might be wrong. Also​:

$ perl -E '
use charnames qw(greek latin);
say charnames​::viacode(ord "\N{upsilon}");
'
LATIN SMALL LETTER UPSILON

$ perl -E '
use charnames qw(latin greek);
say charnames​::viacode(ord "\N{upsilon}");
'
LATIN SMALL LETTER UPSILON

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

From @rgs

2009/7/16 Michael G Schwern <schwern@​pobox.com>​:

Oddly enough, this is a case where a wild stretch of backwards compatibility
was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"'
N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"'
Constant(\N{...}) unknown​: (possibly a missing "use charnames ...") at -e line
1, within string
Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

Seriously, this is getting a bit tiresome. I don't know from where did
originate this myth of P5P being opposed to any form of compatibility
breakage, but it's a myth. A not a much flattering one. No dragons are
slain in it. (or vampires)

Did you notice that I recently added meaning for \N alone in regexes ?
(the opposite of /\n/, for the record.) And what it someone was using
it before ?

@p5pRT
Copy link
Author

p5pRT commented Jul 17, 2009

From @Abigail

On Fri, Jul 17, 2009 at 10​:02​:32AM +0200, Rafael Garcia-Suarez wrote​:

2009/7/16 Michael G Schwern <schwern@​pobox.com>​:

Oddly enough, this is a case where a wild stretch of backwards compatibility
was broken.

$ perl5.5.5 -wle 'print "\N{FOO}"'
N{FOO}

$ perl5.6.1 -wle 'print "\N{FOO}"'
Constant(\N{...}) unknown​: (possibly a missing "use charnames ...") at -e line
1, within string
Execution of -e aborted due to compilation errors.

BUT WHAT IF SOMEONE WAS USING "\N{BLAH}" IN THEIR CODE?!

*ahem*

Seriously, this is getting a bit tiresome. I don't know from where did
originate this myth of P5P being opposed to any form of compatibility
breakage, but it's a myth. A not a much flattering one. No dragons are
slain in it. (or vampires)

Did you notice that I recently added meaning for \N alone in regexes ?
(the opposite of /\n/, for the record.) And what it someone was using
it before ?

Well, in 5.10, \N without braces is actually an error​:

  $ perl -wE '"" =~ /\N/'
  Missing braces on \N{} in regex; marked by <-- HERE in m/\N <-- HERE / at -e line 1.
  $

as it is in 5.8.9 (and 5.6.2)​:

  $ /opt/perl/5.8.9/bin/perl -wle '"" =~ /\N/'
  Missing braces on \N{} at -e line 1, within pattern
  Execution of -e aborted due to compilation errors.
  $

You'd have to go back to the 5.005 era (that is, the previous century)
to be able to have a '\N' in your regexp, and having it mean 'N'.

Also note the line in perlrebackslash​:

  If the character following the backslash is a letter or a digit, then
  the sequence may be special; if so, it’s listed below. A few letters
  have not been used yet, and escaping them with a backslash is safe for
  now, but a future version of Perl may assign a special meaning to it.
  However, if you have warnings turned on, Perl will issue a warning if
  you use such a sequence. [1].

So, it's not that we aren't documenting the fact that a letter preceeded
by a backslash may get a special meaning in a later version of Perl.

Perhaps the only people who may get bitten are the ones that have been
using 'N' as a regexp delimiter.

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2009

From @demerphq

On Sat Jul 11 00​:56​:45 2009, schwern wrote​:

Bizarrely, it works in an eval.

$ perl5.10.0 -wle 'use charnames "​:full"; my $x = ""; print "\N{LATIN
CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/; print
$@​'Constant(\N{LATIN CAPITAL LETTER E}) unknown​: (possibly a missing
"use charnames ...") in regex; marked by <-- HERE in m/\N{LATIN CAPITAL
LETTER E} <-- HERE / at -e line 1.

$ perl5.10.0 -wle 'use charnames "​:full"; my $x = ""; print eval
q{"\N{LATIN CAPITAL LETTER E}" =~ /$x\N{LATIN CAPITAL LETTER E}/};
print $@​'
1

The reason it works in the string eval is because the \N{...} is
interpolated and resolved before eval is called. In raw code the \N{} is
not interpolated before it is handed to the regex engine. Somehow the
regex engine is not seeing the charnames data.

@p5pRT
Copy link
Author

p5pRT commented Oct 17, 2009

From @demerphq

The core problem here is that delayed evaluation of charnames directly
contradicts expectations of charnames behaviour. Specifically, delayed
evaluation may mean that different parts of a pattern are compiled with
different charnames associations.

I can see a few ways to handle this

1. (really hard) Store the charnames in effect with each qr. Hack the
concatenation logic and regex compilation logic to be able to handle
different charnames associations for different subsections of the pattern.
2. (hard) Figure out better semantics for charnames and fix it
3. (moderate) Restore old early evaluation of charnames in regexes. This
has the downside that if you used something like \N{full stop} it would
be the same as putting a literal "." in your pattern, and have the same
magic side effects as "dot" does normally in a pattern. IMO this is not
desirable.
4. (easy?) Use charnames to resolve the character at toker/compile time,
but convert it to an \x{...} escape on the fly when storing it in the
regex pattern. That way it always expands to the right character later
on regardless as to what charnames handlers are in scope at the time.

Option 4 seems to be clearly the best solution. Im not too familiar with
the toker, but my guess is it is probably fairly easy.

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @jbenjore

On Wed, Feb 17, 2010 at 3​:19 PM, karl williamson
<public@​khwilliamson.com> wrote​:

Another consideration is that in working on this patch I found two other
bugs which it fixes.  I don't know enough about security considerations to
know if one or both should be blockers.  I did not file bug reports, since
my patch fixed them; perhaps reports should be filed​:

Yes, please file the reports. I am pondering that perhaps they should
be fixed in maintenance branches of perl too.

Josh

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @obra

On Thu 18.Feb'10 at 12​:18​:17 -0700, karl williamson wrote​:

Jesse Vincent wrote​:

On Thu 18.Feb'10 at 10​:10​:36 -0700, karl williamson wrote​:

I suggest the following​:

\N not followed by a { -> [^\n]
\N{ but no following } -> Error (as it does in 5.10)
\N{...} && ... matches /^[0-9]+(?​:,[0-9]*)?$/
-> [^\n]{...}
Otherwise -> Pass to charnames.

This is essentially the approach my patch takes. The only
difference is that instead of generating an error as above, it
instead generates a warning and matches [^\n]. Obviously it could
be easily changed to do the error.

Let's make that change. Can you supply it as a patch on top of this
one? If there's a chance we're going to ship this in 5.12 (which you
make a very good case for), I'd like it to go out in this weekend's
blead release for more extensive testing.

I'm starting work on that now.

Thanks. I'll revise my request slightly. It looks like your previous
patch doesn't apply cleanly to blead anymore, as you and rgs both
patched regcurly. I don't want to screw up the merge. If it's plausible
to do a rollup of your previous two patches and the new change, that
would be awesome. If not, I'll figure something out.

What I'd like to do for 5.12 is to declare that \N meaning [^\n] is an
EXPERIMENTAL feature. If we're happy with it in a year, we can remove
that warning. If we're not, we can deprecate it away and it'll just
be a bad memory in 5.16.

I think you mean by 'declare' to do so only in the documentation.
If you mean code changes, please clarify.

Correct. I mean notes in perldelta and perlre.

And I also claim that 5.12 should deprecate non-reasonable character
names no matter what else is decided.

What would that change look like?

Just a call to the warner if the name doesn't look like what we
expect. I'll add it as a separate commit, based on a preliminary
definition of "what we expect".

Great. Thank you.

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @druud62

Eric Brine wrote​:

How does $x="\\N"; /$x{...}/ fit in there? [^\n] or "Charnames and you're
silly not to escape your brackets"?

I think it should be [^n] in that case.

Of course that will lead to a popular line of code​:

  my $N = '\\N';

at the start of many Perl programs.

--
Ruud

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @khwilliamson

Attached are two patches. The first is essentially a rebase of the
previously submitted patch to the latest blead, plus some additional
tests. This means it takes into account Rafael's patch to cause
charnames not to be called if there is an error (which prevents it from
working properly), but since all that parsing has been moved to toke.c,
it does it differently.

The second patch changes to fatal the warning when there is a '\N{'
without a matching right brace.

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @khwilliamson

0001-PATCH-perl-56444-delayed-interpolation-of-N.patch
From 16a1ae2d48d9451c4f5baf1e804ff5db1d9085f6 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Thu, 18 Feb 2010 13:41:09 -0700
Subject: [PATCH] PATCH: [perl #56444] delayed interpolation of \N{...}

make regen embed.fnc
needs to be run on this patch.

This patch fixes Bugs #56444 and #62056.

Hopefully we have finally gotten this right.  The parser used to handle
all the escaped constants, expanding \x2e to its single byte equivalent.
The problem is that for regexp patterns, this is a '.', which is a
metacharacter and has special meaning that \x2e does not.  So things
were changed so that the parser didn't expand things in patterns.  But
this causes problems for \N{NAME}, when the pattern doesn't get
evaluated until runtime, as for example when it has a scalar reference
in it, like qr/$foo\N{NAME}/.  We want the value for \N{NAME} that was
in effect at the point during the parsing phase that this regex was
encountered in, but we don't actually look at it until runtime, when
these bug reports show that it is gone.  The solution is for the
tokenizer to parse \N{NAME}, but to compile it into an intermediate
value that won't ever be considered a metacharacter.  We have chosen to
compile NAME to its equivalent code point value, and express it in the
already existing \N{U+...} form.  This indicates to the regex compiler
that the original input was a named character and retains the value it
had at that point in the parse.

This means that \N{U+...} now always must imply Unicode semantics for
the string or pattern it appeared in.  Previously there was an
inconsistency, where effectively \N{NAME} implied Unicode semantics, but
\N{U+...} did not necessarily.  So now, any string or pattern that has
either of these forms is utf8 upgraded.

A complication is that a charnames handler can return a sequence of
multiple characters instead of just one.  To deal with this case, the
tokenizer will generate a constant of the form \N{U+c1.c2.c2...}, where
c1 etc are the individual characters.  Perhaps this will be made a
public interface someday, but I decided to not expose it externally as
far as possible for now in case we find reason to change it.  It is
possible to defeat this by passing it in a single quoted string to the
regex compiler, so the documentation will be changed to discourage that.

A further complication is that \N can have an additional meaning: to
match a non-newline.  This means that the two meanings have to be
disambiguated.

embed.fnc was changed to make public the function regcurly() in
regcomp.c so that it could be referred to in toke.c to see if the ... in
\N{...} is a legal quantifier like {2,}.  This is used in the
disambiguation.

toke.c was changed to update some out-dated relevant comments.
It now parses \N in patterns.  If it determines that it isn't a named
sequence, it passes it through unchanged.  This happens when there is no
brace after the \N, or no closing brace, or if the braces enclose a
legal quantifier.  Previously there has been essentially no restriction
on what can come between the braces so that a custom translator can
accept virtually anything.  Now, legal quantifiers are assumed to mean
that the \N is a "match non-newline that quantity of times".

I removed the #ifdef'd out code that had been left in in case pack U
reverted to earlier behavior.  I did this because it complicated things,
and because the change to pack U has been in long enough and shown that
it is correct so it's not likely to be reverted.

\N meaning a named character is handled differently depending on whether
this is a pattern or not.  In all cases, the output will be upgraded to
utf8 because a named character implies Unicode semantics.  If not a
pattern, the \N is parsed into a utf8 string, as before.  Otherwise it
will be parsed into the intermediate \N{U+...} form.  If the original
was already a valid \N{U+...} constant, it is passed through unchanged.

I now check that the sequence returned by the charnames handler is not
malformed, which was lacking before.

The code in regcomp.c which dealt with interfacing with the charnames
handler has been removed.  All the values should be determined by the
time regcomp.c gets involved.  The affected subroutine is necessarily
restructured.

An EXACT-type node is generated for the character sequence.  Such a node
has a capacity of 255 bytes, and so it is possible to overflow it.  This
wasn't checked for before, but now it is, and a warning issued and the
overflowing characters are discarded.
---
 embed.fnc             |    2 +-
 pod/perl5120delta.pod |   23 +---
 pod/perldiag.pod      |   67 +++++++-
 regcomp.c             |  453 +++++++++++++++++++++---------------------------
 t/lib/Cname.pm        |   20 +++
 t/re/pat.t            |   23 +++-
 t/re/pat_advanced.t   |   32 +++-
 t/re/re_tests         |   25 +++
 t/re/regexp.t         |    4 +
 toke.c                |  384 ++++++++++++++++++++++++++++++++---------
 10 files changed, 659 insertions(+), 374 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 769481b..9ccd663 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -165,6 +165,7 @@ npR	|MEM_SIZE|malloc_good_size	|size_t nbytes
 
 AnpR	|void*	|get_context
 Anp	|void	|set_context	|NN void *t
+EpRnP	|I32	|regcurly	|NN const char *s
 
 END_EXTERN_C
 
@@ -1706,7 +1707,6 @@ Es	|regnode*|regbranch	|NN struct RExC_state_t *pRExC_state \
 Es	|STRLEN	|reguni		|NN const struct RExC_state_t *pRExC_state \
 				|UV uv|NN char *s
 Es	|regnode*|regclass	|NN struct RExC_state_t *pRExC_state|U32 depth
-ERsn	|I32	|regcurly	|NN const char *s
 Es	|regnode*|reg_node	|NN struct RExC_state_t *pRExC_state|U8 op
 Es	|UV	|reg_recode	|const char value|NN SV **encp
 Es	|regnode*|regpiece	|NN struct RExC_state_t *pRExC_state \
diff --git a/pod/perl5120delta.pod b/pod/perl5120delta.pod
index aebeedf..47304ff 100644
--- a/pod/perl5120delta.pod
+++ b/pod/perl5120delta.pod
@@ -237,9 +237,10 @@ for some or all operations. (Yuval Kogman)
 
 A new regex escape has been added, C<\N>. It will match any character that
 is not a newline, independently from the presence or absence of the single
-line match modifier C</s>. (If C<\N> is followed by an opening brace and
+line match modifier C</s>.  It is not usable within a character class.
+(If C<\N> is followed by an opening brace and
 by a letter, perl will still assume that a Unicode character name is
-coming, so compatibility is preserved.) (Rafael Garcia-Suarez)
+coming, so compatibility is preserved.) (Rafael Garcia-Suarez).
 
 This will break a L<custom charnames translator|charnames/CUSTOM TRANSLATORS>
 which allows numbers for character names, as C<\N{3}> will now mean to match 3
@@ -2464,24 +2465,6 @@ take a block as their first argument, like
 
 =item *
 
-The C<charnames> pragma may generate a run-time error when a regex is
-interpolated [RT #56444]:
-
-    use charnames ':full';
-    my $r1 = qr/\N{THAI CHARACTER SARA I}/;
-    "foo" =~ $r1;    # okay
-    "foo" =~ /$r1+/; # runtime error
-
-A workaround is to generate the character outside of the regex:
-
-    my $a = "\N{THAI CHARACTER SARA I}";
-    my $r1 = qr/$a/;
-
-However, C<$r1> must be used within the scope of the C<use charnames> for this
-to work.
-
-=item *
-
 Some regexes may run much more slowly when run in a child thread compared
 with the thread the pattern was compiled into [RT #55600].
 
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index aeb5d27..486a515 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -1912,10 +1912,10 @@ about 250 characters for simple names, and somewhat more for compound
 names (like C<$A::B>).  You've exceeded Perl's limits.  Future versions
 of Perl are likely to eliminate these arbitrary limitations.
 
-=item Ignoring %s in character class in regex; marked by <-- HERE in m/%s/
+=item Ignoring zero length \N{} in character class"
 
-(W) Named Unicode character escapes (\N{...}) may return multi-char
-or zero length sequences. When such an escape is used in a character class
+(W) Named Unicode character escapes (\N{...}) may return a
+zero length sequence.  When such an escape is used in a character class
 its behaviour is not well defined. Check that the correct escape has
 been used, and the correct charname handler is in scope.
 
@@ -2395,6 +2395,10 @@ See also L<Encode/"Handling Malformed Data">.
 (F) Perl thought it was reading UTF-16 encoded character data but while
 doing it Perl met a malformed Unicode surrogate.
 
+=item Malformed UTF-8 returned by \N
+
+(F) The charnames handler returned malformed UTF-8.
+
 =item Malformed UTF-8 string in pack
 
 (F) You tried to pack something that didn't comply with UTF-8 encoding
@@ -2467,7 +2471,7 @@ supplied.
 (F) The argument to the indicated command line switch must follow
 immediately after the switch, without intervening spaces.
 
-=item Missing %sbrace%s on \N{}
+=item Missing braces on \N{}
 
 (F) Wrong syntax of character name literal C<\N{charname}> within
 double-quotish context.
@@ -2506,7 +2510,34 @@ can vary from one line to the next.
 
 =item Missing right brace on %s
 
-(F) Missing right brace in C<\x{...}>, C<\p{...}> or C<\P{...}>.
+(F) Missing right brace in C<\x{...}>, C<\p{...}>, C<\P{...}>, or C<\N{...}>.
+
+=item Missing right brace on \\N{} or unescaped left brace after \\N.  Assuming the latter
+
+(W syntax)
+C<\N> has traditionally been followed by a name enclosed in braces,
+meaning the character (or sequence of characters) given by that name.
+Thus C<\N{ASTERISK}> is another way of writing C<*>, valid in both
+double-quoted strings and regular expression patterns.
+In patterns, it doesn't have the meaning an unescaped C<*> does.
+
+Starting in Perl 5.12.0, C<\N> also can have an additional meaning in patterns,
+namely to match a non-newline character.  (This is like C<.> but is not
+affected by the C</s> modifier.)
+
+This can lead to some ambiguities.  When C<\N> is not followed immediately by a
+left brace, Perl assumes the "match non-newline character" meaning.  Also, if
+the braces form a valid quantifier such as C<\N{3}> or C<\N{5,}>, Perl assumes
+that this means to match the given quantity of non-newlines (in these examples,
+3, and 5 or more, respectively).  In all other case, where there is a C<\N{>
+and a matching C<}>, Perl assumes that a character name is desired.
+
+However, if there is no matching C<}>, Perl doesn't know if it was mistakenly
+omitted, or if "match non-newline" followed by "match a C<{>" was desired.
+It assumes the latter because that is actually a valid interpretation as
+written, unlike the other case.  If you meant the former, you need to add the
+matching right brace.  If you did mean the latter, you can silence this warning
+by writing instead C<\N\{>.
 
 =item Missing right curly or square bracket
 
@@ -2593,6 +2624,13 @@ that yet.
 sense to try to declare one with a package qualifier on the front.  Use
 local() if you want to localize a package variable.
 
+=item \\N in a character class must be a named character: \\N{...}
+
+The new (5.12) meaning of C<\N> to match non-newlines is not valid in a
+bracketed character class, for the same reason that C<.> in a character class
+loses its specialness: it matches almost everything, which is probably not what
+you want.
+
 =item Name "%s::%s" used only once: possible typo
 
 (W once) Typographical errors often show up as unique variable names.
@@ -2605,6 +2643,11 @@ NOTE: This warning detects symbols that have been used only once so $c, @c,
 the same; if a program uses $c only once but also uses any of the others it
 will not trigger this warning.
 
+=item Invalid hexadecimal number in \\N{U+...}
+
+(F) The character constant represented by C<...> is not a valid hexadecimal
+number.
+
 =item Negative '/' count in unpack
 
 (F) The length count obtained from a length/code unpack operation was
@@ -4943,6 +4986,20 @@ C<< @foo->[23] >> or C<< @$ref->[99] >>.  Versions of perl <= 5.6.1 used to
 allow this syntax, but shouldn't have. It is now deprecated, and will be
 removed in a future version.
 
+=item Using just the first character returned by \N{} in character class
+
+(W) A charnames handler may return a sequence of more than one character.
+Currently all but the first one are discarded when used in a regular
+expression pattern bracketed character class.
+
+=item Using just the first characters returned by \N{}
+
+(W) A charnames handler may return a sequence of characters.  There is a finite
+limit as to the number of characters that can be used, which this sequence
+exceeded.  In the message, the characters in the sequence are separated by
+dots, and each is shown by its ordinal in hex.  Anything to the left of the
+C<HERE> was retained; anything to the right was discarded.
+
 =item UTF-16 surrogate %s
 
 (W utf8) You tried to generate half of a UTF-16 surrogate by
diff --git a/regcomp.c b/regcomp.c
index 6669d58..ce4104a 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -132,7 +132,6 @@ typedef struct RExC_state_t {
     I32		orig_utf8;	/* whether the pattern was originally in utf8 */
 				/* XXX use this for future optimisation of case
 				 * where pattern must be upgraded to utf8. */
-    HV		*charnames;		/* cache of named sequences */
     HV		*paren_names;		/* Paren names */
     
     regnode	**recurse;		/* Recurse regops */
@@ -177,7 +176,6 @@ typedef struct RExC_state_t {
 #define RExC_seen_evals	(pRExC_state->seen_evals)
 #define RExC_utf8	(pRExC_state->utf8)
 #define RExC_orig_utf8	(pRExC_state->orig_utf8)
-#define RExC_charnames  (pRExC_state->charnames)
 #define RExC_open_parens	(pRExC_state->open_parens)
 #define RExC_close_parens	(pRExC_state->close_parens)
 #define RExC_opend	(pRExC_state->opend)
@@ -4268,7 +4266,6 @@ redo_first_pass:
     RExC_size = 0L;
     RExC_emit = &PL_regdummy;
     RExC_whilem_seen = 0;
-    RExC_charnames = NULL;
     RExC_open_parens = NULL;
     RExC_close_parens = NULL;
     RExC_opend = NULL;
@@ -6589,56 +6586,69 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
    recognized '\N' and needs to handle the rest. RExC_parse is
    expected to point at the first char following the N at the time
    of the call.
+
+   The \N may be inside (indicated by valuep not being NULL) or outside a
+   character class.
+
+   \N may begin either a named sequence, or if outside a character class, mean
+   to match a non-newline.  For non single-quoted regexes, the tokenizer has
+   attempted to decide which, and in the case of a named sequence converted it
+   into one of the forms: \N{} (if the sequence is null), or \N{U+c1.c2...},
+   where c1... are the characters in the sequence.  For single-quoted regexes,
+   the tokenizer passes the \N sequence through unchanged; this code will not
+   attempt to determine this nor expand those.  The net effect is that if the
+   beginning of the passed-in pattern isn't '{U+' or there is no '}', it
+   signals that this \N occurrence means to match a non-newline.
+   
+   Only the \N{U+...} form should occur in a character class, for the same
+   reason that '.' inside a character class means to just match a period: it
+   just doesn't make sense.
    
    If valuep is non-null then it is assumed that we are parsing inside 
    of a charclass definition and the first codepoint in the resolved
    string is returned via *valuep and the routine will return NULL. 
    In this mode if a multichar string is returned from the charnames 
-   handler a warning will be issued, and only the first char in the 
+   handler, a warning will be issued, and only the first char in the 
    sequence will be examined. If the string returned is zero length
    then the value of *valuep is undefined and NON-NULL will 
    be returned to indicate failure. (This will NOT be a valid pointer 
    to a regnode.)
    
-   If valuep is null then it is assumed that we are parsing normal text
-   and inserts a new EXACT node into the program containing the resolved
-   string and returns a pointer to the new node. If the string is 
-   zerolength a NOTHING node is emitted.
+   If valuep is null then it is assumed that we are parsing normal text and a
+   new EXACT node is inserted into the program containing the resolved string,
+   and a pointer to the new node is returned.  But if the string is zero length
+   a NOTHING node is emitted instead.
 
    On success RExC_parse is set to the char following the endbrace.
-   Parsing failures will generate a fatal errorvia vFAIL(...)
-   
-   NOTE: We cache all results from the charnames handler locally in 
-   the RExC_charnames hash (created on first use) to prevent a charnames 
-   handler from playing silly-buggers and returning a short string and 
-   then a long string for a given pattern. Since the regexp program 
-   size is calculated during an initial parse this would result
-   in a buffer overrun so we cache to prevent the charname result from
-   changing during the course of the parse.
-   
+   Parsing failures will generate a fatal error via vFAIL(...)
  */
 STATIC regnode *
 S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 {
-    char * name;        /* start of the content of the name */
     char * endbrace;    /* endbrace following the name */
-    SV *sv_str = NULL;  
-    SV *sv_name = NULL;
-    STRLEN len; /* this has various purposes throughout the code */
-    bool cached = 0; /* if this is true then we shouldn't refcount dev sv_str */
     regnode *ret = NULL;
+#ifdef DEBUGGING
+    char* parse_start = RExC_parse - 2;	    /* points to the '\N' */
+#endif
+
+    GET_RE_DEBUG_FLAGS_DECL;
  
     PERL_ARGS_ASSERT_REG_NAMEDSEQ;
+
+    GET_RE_DEBUG_FLAGS;
    
-    if (*RExC_parse != '{' ||
-	    (*RExC_parse == '{' && RExC_parse[1]
-	     && strchr("0123456789", RExC_parse[1])))
+    /* Disambiguate between \N meaning a named character versus \N meaning
+     * don't match a newline. */
+    if (*RExC_parse != '{'
+	|| (! (endbrace = strchr(RExC_parse, '}'))) /* no trailing brace */
+	|| ! (endbrace == RExC_parse + 1	/* nothing between the {} */
+	      || (endbrace - RExC_parse > 3	/* U+ and at least one hex */
+		  && strnEQ(RExC_parse + 1, "U+", 2))))
     {
-	GET_RE_DEBUG_FLAGS_DECL;
-	if (valuep)
+	if (valuep) {
 	    /* no bare \N in a charclass */
-	    vFAIL("Missing braces on \\N{}");
-	GET_RE_DEBUG_FLAGS;
+	    vFAIL("\\N in a character class must be a named character: \\N{...}");
+	}
 	nextchar(pRExC_state);
 	ret = reg_node(pRExC_state, REG_ANY);
 	*flagp |= HASWIDTH|SIMPLE;
@@ -6647,235 +6657,168 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
         Set_Node_Length(ret, 1); /* MJD */
 	return ret;
     }
-    name = RExC_parse+1;
-    endbrace = strchr(RExC_parse, '}');
-    if ( ! endbrace ) {
-        RExC_parse++;
-        vFAIL("Missing right brace on \\N{}");
-    } 
-    RExC_parse = endbrace + 1;  
-    
-    
-    /* RExC_parse points at the beginning brace, 
-       endbrace points at the last */
-    if ( name[0]=='U' && name[1]=='+' ) {
-        /* its a "Unicode hex" notation {U+89AB} */
-        I32 fl = PERL_SCAN_ALLOW_UNDERSCORES
-            | PERL_SCAN_DISALLOW_PREFIX
-            | (SIZE_ONLY ? PERL_SCAN_SILENT_ILLDIGIT : 0);
-        UV cp;
-        len = (STRLEN)(endbrace - name - 2);
-        cp = grok_hex(name + 2, &len, &fl, NULL);
-        if ( len != (STRLEN)(endbrace - name - 2) ) {
-            cp = 0xFFFD;
-        }    
-        if ( valuep ) {
-	    if (cp > 0xff) RExC_utf8 = 1;
-            *valuep = cp;
-            return NULL;
-        }
 
-	/* Need to convert to utf8 if either: won't fit into a byte, or the re
-	 * is going to be in utf8 and the representation changes under utf8. */
-	if (cp > 0xff || (RExC_utf8 && ! UNI_IS_INVARIANT(cp))) {
-	    U8 string[UTF8_MAXBYTES+1];
-	    U8 *tmps;
-	    RExC_utf8 = 1;
-	    tmps = uvuni_to_utf8(string, cp);
-	    sv_str = newSVpvn_utf8((char*)string, tmps - string, TRUE);
-	} else {    /* Otherwise, no need for utf8, can skip that step */
-	    char string;
-	    string = (char)cp;
-	    sv_str= newSVpvn(&string, 1);
+    /* Here, we have decided it is a named sequence */
+    RExC_parse++;	/* Skip past the '{' */
+    if (endbrace == RExC_parse) {   /* empty: \N{} */
+	if (! valuep) {
+	    RExC_parse = endbrace + 1;  
+	    return reg_node(pRExC_state,NOTHING);
 	}
-    } else {
-        /* fetch the charnames handler for this scope */
-        HV * const table = GvHV(PL_hintgv);
-        SV **cvp= table ? 
-            hv_fetchs(table, "charnames", FALSE) :
-            NULL;
-        SV *cv= cvp ? *cvp : NULL;
-        HE *he_str;
-        int count;
-        /* create an SV with the name as argument */
-        sv_name = newSVpvn(name, endbrace - name);
-        
-        if (!table || !(PL_hints & HINT_LOCALIZE_HH)) {
-            vFAIL2("Constant(\\N{%" SVf "}) unknown: "
-                  "(possibly a missing \"use charnames ...\")",
-                  SVfARG(sv_name));
-        }
-        if (!cvp || !SvOK(*cvp)) { /* when $^H{charnames} = undef; */
-            vFAIL2("Constant(\\N{%" SVf "}): "
-                  "$^H{charnames} is not defined", SVfARG(sv_name));
-        }
-        
-        
-        
-        if (!RExC_charnames) {
-            /* make sure our cache is allocated */
-            RExC_charnames = newHV();
-            sv_2mortal(MUTABLE_SV(RExC_charnames));
-        } 
-            /* see if we have looked this one up before */
-        he_str = hv_fetch_ent( RExC_charnames, sv_name, 0, 0 );
-        if ( he_str ) {
-            sv_str = HeVAL(he_str);
-            cached = 1;
-	} else if (PL_parser && PL_parser->error_count > 0) {
-	    /* Don't attempt to load charnames if we're already in error */
-	    vFAIL("Too many errors, cannot continue parsing");
-        } else {
-            dSP ;
 
-            ENTER ;
-            SAVETMPS ;
-            PUSHMARK(SP) ;
-            
-            XPUSHs(sv_name);
-            
-            PUTBACK ;
-            
-            count= call_sv(cv, G_SCALAR);
-            SPAGAIN ;
-            
-            if (count == 1) { /* XXXX is this right? dmq */
-                sv_str = POPs;
-                SvREFCNT_inc_simple_void(sv_str);
-            } 
-            
-            PUTBACK ;
-            FREETMPS ;
-            LEAVE ;
-            
-            if ( !sv_str || !SvOK(sv_str) ) {
-                vFAIL2("Constant(\\N{%" SVf "}): Call to &{$^H{charnames}} "
-		       "did not return a defined value", SVfARG(sv_name));
-            }
-            if (hv_store_ent( RExC_charnames, sv_name, sv_str, 0))
-                cached = 1;
-        }
+	if (SIZE_ONLY) {
+	    ckWARNreg(RExC_parse,
+		    "Ignoring zero length \\N{} in character class"
+	    );
+	    RExC_parse = endbrace + 1;  
+	}
+	*valuep = 0;
+	return (regnode *) &RExC_parse; /* Invalid regnode pointer */
     }
-    if (valuep) {
-        char *p = SvPV(sv_str, len);
-        if (len) {
-            STRLEN numlen = 1;
-            if ( SvUTF8(sv_str) ) {
-                *valuep = utf8_to_uvchr((U8*)p, &numlen);
-                if (*valuep > 0x7F)
-                    RExC_utf8 = 1; 
-                /* XXXX
-                  We have to turn on utf8 for high bit chars otherwise
-                  we get failures with
-                  
-                   "ss" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i
-                   "SS" =~ /[\N{LATIN SMALL LETTER SHARP S}]/i
-                
-                  This is different from what \x{} would do with the same
-                  codepoint, where the condition is > 0xFF.
-                  - dmq
-                */
-                
-                
-            } else {
-                *valuep = (UV)*p;
-                /* warn if we havent used the whole string? */
-            }
-            if (numlen<len && SIZE_ONLY) {
-                ckWARN2reg(RExC_parse,
-			   "Ignoring excess chars from \\N{%" SVf "} in character class",
-			   SVfARG(sv_name)
-                );
-            }        
-        } else if (SIZE_ONLY) {
-            ckWARN2reg(RExC_parse,
-		       "Ignoring zero length \\N{%" SVf "} in character class",
-		       SVfARG(sv_name)
-                );
-        }
-        SvREFCNT_dec(sv_name);
-        if (!cached)
-            SvREFCNT_dec(sv_str);    
-        return len ? NULL : (regnode *)&len;
-    } else if(SvCUR(sv_str)) {     
-        
-        char *s; 
-        char *p, *pend;        
-        STRLEN charlen = 1;
-#ifdef DEBUGGING
-        char * parse_start = name-3; /* needed for the offsets */
-#endif
-        GET_RE_DEBUG_FLAGS_DECL;     /* needed for the offsets */
-        
-        ret = reg_node(pRExC_state,
-            (U8)(FOLD ? (LOC ? EXACTFL : EXACTF) : EXACT));
-        s= STRING(ret);
-        
-        if ( RExC_utf8 && !SvUTF8(sv_str) ) {
-            sv_utf8_upgrade(sv_str);
-        } else if ( !RExC_utf8 && SvUTF8(sv_str) ) {
-            RExC_utf8= 1;
-        }
-        
-        p = SvPV(sv_str, len);
-        pend = p + len;
-        /* len is the length written, charlen is the size the char read */
-        for ( len = 0; p < pend; p += charlen ) {
-            if (UTF) {
-                UV uvc = utf8_to_uvchr((U8*)p, &charlen);
-                if (FOLD) {
-                    STRLEN foldlen,numlen;
-                    U8 tmpbuf[UTF8_MAXBYTES_CASE+1], *foldbuf;
-                    uvc = toFOLD_uni(uvc, tmpbuf, &foldlen);
-                    /* Emit all the Unicode characters. */
-                    
-                    for (foldbuf = tmpbuf;
-                        foldlen;
-                        foldlen -= numlen) 
-                    {
-                        uvc = utf8_to_uvchr(foldbuf, &numlen);
-                        if (numlen > 0) {
-                            const STRLEN unilen = reguni(pRExC_state, uvc, s);
-                            s       += unilen;
-                            len     += unilen;
-                            /* In EBCDIC the numlen
-                            * and unilen can differ. */
-                            foldbuf += numlen;
-                            if (numlen >= foldlen)
-                                break;
-                        }
-                        else
-                            break; /* "Can't happen." */
-                    }                          
-                } else {
-                    const STRLEN unilen = reguni(pRExC_state, uvc, s);
-        	    if (unilen > 0) {
-        	       s   += unilen;
-        	       len += unilen;
-        	    }
-        	}
-	    } else {
-                len++;
-                REGC(*p, s++);
-            }
-        }
-        if (SIZE_ONLY) {
-            RExC_size += STR_SZ(len);
-        } else {
-            STR_LEN(ret) = len;
-            RExC_emit += STR_SZ(len);
-        }
-        Set_Node_Cur_Length(ret); /* MJD */
-        RExC_parse--; 
-        nextchar(pRExC_state);
-    } else {	/* zero length */
-        ret = reg_node(pRExC_state,NOTHING);
+
+    RExC_utf8 = 1;	/* named sequences imply Unicode semantics */
+    RExC_parse += 2;	/* Skip past the 'U+' */
+
+    if (valuep) {   /* In a bracketed char class */
+	/* We only pay attention to the first char of 
+	multichar strings being returned. I kinda wonder
+	if this makes sense as it does change the behaviour
+	from earlier versions, OTOH that behaviour was broken
+	as well. XXX Solution is to recharacterize as
+	[rest-of-class]|multi1|multi2... */
+
+	STRLEN length_of_hex;
+	I32 flags = PERL_SCAN_ALLOW_UNDERSCORES
+	    | PERL_SCAN_DISALLOW_PREFIX
+	    | (SIZE_ONLY ? PERL_SCAN_SILENT_ILLDIGIT : 0);
+    
+	char * endchar = strchr(RExC_parse, '.');
+	if (endchar) {
+	    ckWARNreg(endchar, "Using just the first character returned by \\N{} in character class");
+	}
+	else endchar = endbrace;
+
+	length_of_hex = (STRLEN)(endchar - RExC_parse);
+	*valuep = grok_hex(RExC_parse, &length_of_hex, &flags, NULL);
+
+	/* The tokenizer should have guaranteed validity, but it's possible to
+	 * bypass it by using single quoting, so check */
+	if ( length_of_hex != (STRLEN)(endchar - RExC_parse) ) {
+	    *valuep = UNICODE_REPLACEMENT;
+	}    
+
+	RExC_parse = endbrace + 1;
+	if (endchar == endbrace) return NULL;
+
+        ret = (regnode *) &RExC_parse;	/* Invalid regnode pointer */
     }
-    SvREFCNT_dec(sv_name);
-    if (!cached)
-        SvREFCNT_dec(sv_str);
-    return ret;
+    else {	/* Not a char class */
+	char *s;	    /* String to put in generated EXACT node */
+	STRLEN len = 0;	    /* Its current length */
+	char *endchar;	    /* Points to '.' or '}' ending cur char in the input
+			       stream */
+
+	ret = reg_node(pRExC_state,
+			(U8)(FOLD ? (LOC ? EXACTFL : EXACTF) : EXACT));
+	s= STRING(ret);
+
+	/* Exact nodes can hold only a U8 length's of text = 255.  Loop through
+	 * the input which is of the form now 'c1.c2.c3...}' until find the
+	 * ending brace or exeed length 255.  The characters that exceed this
+	 * limit are dropped.  The limit could be relaxed should it become
+	 * desirable by reparsing this as (?:\N{NAME}), so could generate
+	 * multiple EXACT nodes, as is done for just regular input.  But this
+	 * is primarily a named character, and not intended to be a huge long
+	 * string, so 255 bytes should be good enough */
+	while (1) {
+	    STRLEN this_char_length;
+	    I32 grok_flags = PERL_SCAN_ALLOW_UNDERSCORES
+			    | PERL_SCAN_DISALLOW_PREFIX
+			    | (SIZE_ONLY ? PERL_SCAN_SILENT_ILLDIGIT : 0);
+	    UV cp;  /* Ord of current character */
+
+	    /* Code points are separated by dots.  If none, there is only one
+	     * code point, and is terminated by the brace */
+	    endchar = strchr(RExC_parse, '.');
+	    if (! endchar) endchar = endbrace;
+
+	    /* The values are Unicode even on EBCDIC machines */
+	    this_char_length = (STRLEN)(endchar - RExC_parse);
+	    cp = grok_hex(RExC_parse, &this_char_length, &grok_flags, NULL);
+	    if ( this_char_length == 0 
+		|| this_char_length != (STRLEN)(endchar - RExC_parse) )
+	    {
+		cp = UNICODE_REPLACEMENT;   /* Substitute a valid character */
+	    }    
+
+	    if (! FOLD) {	/* Not folding, just append to the string */
+		STRLEN unilen;
+
+		/* Quit before adding this character if would exceed limit */
+		if (len + UNISKIP(cp) > U8_MAX) break;
 
+		unilen = reguni(pRExC_state, cp, s);
+		if (unilen > 0) {
+		    s   += unilen;
+		    len += unilen;
+		}
+	    } else {	/* Folding, output the folded equivalent */
+		STRLEN foldlen,numlen;
+		U8 tmpbuf[UTF8_MAXBYTES_CASE+1], *foldbuf;
+		cp = toFOLD_uni(cp, tmpbuf, &foldlen);
+
+		/* Quit before exceeding size limit */
+		if (len + foldlen > U8_MAX) break;
+		
+		for (foldbuf = tmpbuf;
+		    foldlen;
+		    foldlen -= numlen) 
+		{
+		    cp = utf8_to_uvchr(foldbuf, &numlen);
+		    if (numlen > 0) {
+			const STRLEN unilen = reguni(pRExC_state, cp, s);
+			s       += unilen;
+			len     += unilen;
+			/* In EBCDIC the numlen and unilen can differ. */
+			foldbuf += numlen;
+			if (numlen >= foldlen)
+			    break;
+		    }
+		    else
+			break; /* "Can't happen." */
+		}                          
+	    }
+
+	    /* Point to the beginning of the next character in the sequence. */
+	    RExC_parse = endchar + 1;
+
+	    /* Quit if no more characters */
+	    if (RExC_parse >= endbrace) break;
+	}
+
+
+	if (SIZE_ONLY) {
+	    if (RExC_parse < endbrace) {
+		ckWARNreg(RExC_parse - 1,
+			  "Using just the first characters returned by \\N{}");
+	    }
+
+	    RExC_size += STR_SZ(len);
+	} else {
+	    STR_LEN(ret) = len;
+	    RExC_emit += STR_SZ(len);
+	}
+
+	RExC_parse = endbrace + 1;
+
+	*flagp |= HASWIDTH; /* Not SIMPLE, as that causes the engine to fail
+			       with malformed in t/re/pat_advanced.t */
+	RExC_parse --;
+	Set_Node_Cur_Length(ret); /* MJD */
+	nextchar(pRExC_state);
+    }
+
+    return ret;
 }
 
 
@@ -8908,8 +8851,8 @@ S_regtail_study(pTHX_ RExC_state_t *pRExC_state, regnode *p, const regnode *val,
 /*
  - regcurly - a little FSA that accepts {\d+,?\d*}
  */
-STATIC I32
-S_regcurly(register const char *s)
+I32
+Perl_regcurly(register const char *s)
 {
     PERL_ARGS_ASSERT_REGCURLY;
 
diff --git a/t/lib/Cname.pm b/t/lib/Cname.pm
index d4b8a9e..562f59a 100644
--- a/t/lib/Cname.pm
+++ b/t/lib/Cname.pm
@@ -4,6 +4,7 @@ our $Evil='A';
 sub translator {
     my $str = shift;
     if ( $str eq 'EVIL' ) {
+        # Returns A first time, AB second, ABC third ... A-ZA the 27th time.
         (my $c=substr("A".$Evil,-1))++;
         my $r=$Evil;
         $Evil.=$c;
@@ -12,6 +13,25 @@ sub translator {
     if ( $str eq 'EMPTY-STR') {
        return "";
     }
+    if ( $str eq 'NULL') {
+        return "\0";
+    }
+    if ( $str eq 'LONG-STR') {
+        return 'A' x 255;
+    }
+    # Should exceed limit for regex \N bytes in a sequence.  Anyway it will if
+    # UCHAR_MAX is 255.
+    if ( $str eq 'TOO-LONG-STR') {
+       return 'A' x 256;
+    }
+    if ($str eq 'MALFORMED') {
+        $str = "\xDF\xDFabc";
+        utf8::upgrade($str);
+         
+        # Create a malformed in first and second characters.
+        $str =~ s/^\C/A/;
+        $str =~ s/^(\C\C)\C/$1A/;
+    }
     return $str;
 }
 
diff --git a/t/re/pat.t b/t/re/pat.t
index 314e52b..40ae52e 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -2,7 +2,9 @@
 #
 # This is a home for regular expression tests that don't fit into
 # the format supported by re/regexp.t.  If you want to add a test
-# that does fit that format, add it to re/re_tests, not here.
+# that does fit that format, add it to re/re_tests, not here.  Tests for \N
+# should be added here because they are treated as single quoted strings
+# there, which means they avoid the lexer which otherwise would look at them.
 
 use strict;
 use warnings;
@@ -21,7 +23,7 @@ BEGIN {
 }
 
 
-plan tests => 293;  # Update this when adding/deleting tests.
+plan tests => 297;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -969,6 +971,23 @@ sub run_tests {
         iseq "@space2", "spc tab";
     }
 
+    {
+        use charnames ":full";
+        local $Message = 'Delayed interpolation of \N';
+        my $r1 = qr/\N{THAI CHARACTER SARA I}/;
+        my $s1 = "\x{E34}\x{E34}\x{E34}\x{E34}";
+
+        # Bug #56444
+        ok $s1 =~ /$r1+/, 'my $r1 = qr/\N{THAI CHARACTER SARA I}/; my $s1 = "\x{E34}\x{E34}\x{E34}\x{E34}; $s1 =~ /$r1+/';
+
+        # Bug #62056
+        ok "${s1}A" =~ m/$s1\N{LATIN CAPITAL LETTER A}/, '"${s1}A" =~ m/$s1\N{LATIN CAPITAL LETTER A}/';
+
+        ok "abbbbc" =~ m/\N{1}/ && $& eq "a", '"abbbbc" =~ m/\N{1}/ && $& eq "a"';
+        ok "abbbbc" =~ m/\N{3,4}/ && $& eq "abbb", '"abbbbc" =~ m/\N{3,4}/ && $& eq "abbb"';
+    }
+
+
 } # End of sub run_tests
 
 1;
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 3a66a0c..86735ec 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -21,7 +21,7 @@ BEGIN {
 }
 
 
-plan tests => 1143;  # Update this when adding/deleting tests.
+plan tests => 1155;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1024,21 +1024,20 @@ sub run_tests {
         use Cname;
 
         ok 'fooB'  =~ /\N{foo}[\N{B}\N{b}]/, "Passthrough charname";
-        my $test   = 1233;
         #
         # Why doesn't must_warn work here?
         #
         my $w;
         local $SIG {__WARN__} = sub {$w .= "@_"};
         eval 'q(xxWxx) =~ /[\N{WARN}]/';
-        ok $w && $w =~ /^Ignoring excess chars from/,
-                 "Ignoring excess chars warning";
+        ok $w && $w =~ /Using just the first character returned by \\N{} in character class/,
+                 "single character in [\\N{}] warning";
 
         undef $w;
         eval q [ok "\0" !~ /[\N{EMPTY-STR}XY]/,
                    "Zerolength charname in charclass doesn't match \\0"];
-        ok $w && $w =~ /^Ignoring zero length/,
-                 'Ignoring zero length \N{%} in character class warning';
+        ok $w && $w =~ /Ignoring zero length/,
+                 'Ignoring zero length \N{} in character class warning';
 
         ok 'AB'  =~ /(\N{EVIL})/ && $1 eq 'A', 'Charname caching $1';
         ok 'ABC' =~ /(\N{EVIL})/,              'Charname caching $1';
@@ -1046,6 +1045,26 @@ sub run_tests {
                     'Empty string charname produces NOTHING node';
         ok ''    =~ /\N{EMPTY-STR}/,
                     'Empty string charname produces NOTHING node';
+        ok "\N{LONG-STR}" =~ /^\N{LONG-STR}$/, 'Verify that long string works';
+        ok "\N{LONG-STR}" =~ /^\N{LONG-STR}$/i, 'Verify under folding that long string works';
+
+        # If remove the limitation in regcomp code these should work
+        # differently
+        undef $w;
+        eval q [ok "\N{LONG-STR}" =~ /^\N{TOO-LONG-STR}$/, 'Verify that too long a string fails gracefully'];
+        ok $w && $w =~ /Using just the first characters returned/, 'Verify that got too-long string warning in \N{} that exceeds the limit';
+        undef $w;
+        eval q [ok "\N{LONG-STR}" =~ /^\N{TOO-LONG-STR}$/i, 'Verify under folding that too long a string fails gracefully'];
+        ok $w && $w =~ /Using just the first characters returned/, 'Verify under folding that got too-long string warning in \N{} that exceeds the limit';
+        undef $w;
+        eval q [ok "\N{TOO-LONG-STR}" !~ /^\N{TOO-LONG-STR}$/, 'Verify that too long a string doesnt work'];
+        ok $w && $w =~ /Using just the first characters returned/, 'Verify that got too-long string warning in \N{} that exceeds the limit';
+        undef $w;
+        eval q [ok "\N{TOO-LONG-STR}" !~ /^\N{TOO-LONG-STR}$/i, 'Verify under folding that too long a string doesnt work'];
+        ok $w && $w =~ /Using just the first characters returned/i, 'Verify under folding that got too-long string warning in \N{} that exceeds the limit';
+        undef $w;
+        eval 'q(syntax error) =~ /\N{MALFORMED}/';
+        ok $@ && $@ =~ /Malformed/, 'Verify that malformed utf8 gives an error';
 
     }
 
@@ -1064,6 +1083,7 @@ sub run_tests {
         ok "\N{SPACE}\N{U+0041}\N{SPACE}\N{U+0042}" =~
            /[\N{SPACE}\N{U+0041}][\N{SPACE}\N{U+0042}]/,
             'Intermixed named and unicode escapes';
+        ok "\0" =~ /^\N{NULL}$/, 'Verify that \N{NULL} works; is not confused with an error';
     }
 
 
diff --git a/t/re/re_tests b/t/re/re_tests
index dc03084..6304fe6 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1388,6 +1388,13 @@ foo(\h)bar	foo\tbar	y	$1	\t
 # [perl #60344] Regex lookbehind failure after an (if)then|else in perl 5.10
 /\A(?(?=db2)db2|\D+)(?<!processed)\.csv\z/xms	sql_processed.csv	n	-	-
 /\N{U+0100}/	\x{100}	y	$&	\x{100}	# Bug #59328
+/[a\N{U+0100}]/	\x{100}	y	$&	\x{100}
+/[a\N{U+0100}]/	a	y	$&	a
+
+# Verify that \N{U+...} forces Unicode semantics
+/\N{U+41}\x{c1}/i	a\x{e1}	y	$&	a\x{e1}
+/[\N{U+41}\x{c1}]/i	\x{e1}	y	$&	\x{e1}
+
 [\s][\S]	\x{a0}\x{a0}	nT	-	-	# Unicode complements should not match same character
 
 # was generating malformed utf8
@@ -1395,4 +1402,22 @@ foo(\h)bar	foo\tbar	y	$1	\t
 
 ((??{ "(?:|)" }))\s	C\x20 	y	-	-
 
+# Verify that \ escapes the { after \N, and causes \N to match non-newline
+abc\N\{U+BEEF}	abc\n{UBEEF}	n		
+abc\N\{U+BEEF}	abc.{UBEEF}	y	$&	abc.{UBEEF}
+[abc\N\{U+BEEF}]	-	c	-	\\N in a character class must be a named character
+
+# Verify that \N can be trailing and causes \N to match non-newline
+abc\N	abcd	y	$&	abcd
+abc\N	abc\n	n		
+
+# Verify get errors.  For these, we need // or else puts it in single quotes,
+# and doesn't expand.
+/\N{U+}/	-	c	-	Invalid hexadecimal number
+/abc\N{def/	-	c	-	Missing right brace
+
+# Verifies catches hex errors, and doesn't expose our . notation to the outside
+/\N{U+0xBEEF}/	- 	c	-	Illegal hexadecimal digit
+/\N{U+BEEF.BEAD}/	- 	c	-	Illegal hexadecimal digit
+
 # vim: set noexpandtab
diff --git a/t/re/regexp.t b/t/re/regexp.t
index 2344610..558c6f0 100644
--- a/t/re/regexp.t
+++ b/t/re/regexp.t
@@ -36,6 +36,10 @@
 # If you want to add a regular expression test that can't be expressed
 # in this format, don't add it here: put it in re/pat.t instead.
 #
+# Note that the inputs get passed on as "m're'", so the re bypasses the lexer.
+# This means this file cannot be used for testing anything that the lexer
+# handles; in 5.12 this means just \N{NAME} and \N{U+...}.
+#
 # Note that columns 2,3 and 5 are all enclosed in double quotes and then
 # evalled; so something like a\"\x{100}$1 has length 3+length($1).
 
diff --git a/toke.c b/toke.c
index 9df0ff2..fcfdd71 100644
--- a/toke.c
+++ b/toke.c
@@ -2471,10 +2471,7 @@ S_sublex_done(pTHX)
 
   In patterns:
     backslashes:
-      double-quoted style: \r and \n
-      regexp special ones: \D \s
-      constants: \x31
-      backrefs: \1
+      constants: \N{NAME} only
       case and quoting: \U \Q \E
     stops on @ and $, but not for $ as tail anchor
 
@@ -2488,7 +2485,7 @@ S_sublex_done(pTHX)
   In double-quoted strings:
     backslashes:
       double-quoted style: \r and \n
-      constants: \x31
+      constants: \x31, etc.
       deprecated backrefs: \1 (in substitution replacements)
       case and quoting: \U \Q \E
     stops on @ and $
@@ -2516,14 +2513,14 @@ S_sublex_done(pTHX)
 	  check for embedded arrays
 	  check for embedded scalars
 	  if (backslash) {
-	      leave intact backslashes from leaveit (below)
 	      deprecate \1 in substitution replacements
 	      handle string-changing backslashes \l \U \Q \E, etc.
 	      switch (what was escaped) {
 		  handle \- in a transliteration (becomes a literal -)
+		  if a pattern and not \N{, go treat as regular character
 		  handle \132 (octal characters)
 		  handle \x15 and \x{1234} (hex characters)
-		  handle \N{name} (named characters)
+		  handle \N{name} (named characters, also \N{3,5} in a pattern)
 		  handle \cV (control characters)
 		  handle printf-style backslashes (\f, \r, \n, etc)
 	      } (end switch)
@@ -2581,6 +2578,7 @@ S_scan_const(pTHX_ char *start)
 
 
     while (s < send || dorange) {
+
         /* get transliterations out of the way (they're most literal) */
 	if (PL_lex_inwhat == OP_TRANS) {
 	    /* expand a range A-Z to the full set of characters.  AIE! */
@@ -2800,6 +2798,8 @@ S_scan_const(pTHX_ char *start)
 
 	/* backslashes */
 	if (*s == '\\' && s+1 < send) {
+	    char* e;	/* Can be used for ending '}', etc. */
+
 	    s++;
 
 	    /* deprecate \1 in strings and substitution replacements */
@@ -2816,13 +2816,28 @@ S_scan_const(pTHX_ char *start)
 		--s;
 		break;
 	    }
-	    /* skip any other backslash escapes in a pattern */
-	    else if (PL_lex_inpat) {
+	    /* In a pattern, process \N, but skip any other backslash escapes.
+	     * This is because we don't want to translate an escape sequence
+	     * into a meta symbol and have the regex compiler use the meta
+	     * symbol meaning, e.g. \x{2E} would be confused with a dot.  But
+	     * in spite of this, we do have to process \N here while the proper
+	     * charnames handler is in scope.  See bugs #56444 and #62056.
+	     * There is a complication because \N in a pattern may also stand
+	     * for 'match a non-nl', and not mean a charname, in which case its
+	     * processing should be deferred to the regex compiler.  To be a
+	     * charname it must be followed immediately by a '{', and not look
+	     * like \N followed by a curly quantifier, i.e., not something like
+	     * \N{3,}.  regcurly returns a boolean indicating if it is a legal
+	     * quantifier */
+	    else if (PL_lex_inpat
+		    && (*s != 'N'
+			|| s[1] != '{'
+			|| regcurly(s + 1)))
+	    {
 		*d++ = NATIVE_TO_NEED(has_utf8,'\\');
 		goto default_action;
 	    }
 
-	    /* if we get here, it's either a quoted -, or a digit */
 	    switch (*s) {
 
 	    /* quoted - in transliterations */
@@ -2881,15 +2896,13 @@ S_scan_const(pTHX_ char *start)
 		}
 
 	      NUM_ESCAPE_INSERT:
-		/* Insert oct, hex, or \N{U+...} escaped character.  There will
-		 * always be enough room in sv since such escapes will be
-		 * longer than any UTF-8 sequence they can end up as, except if
-		 * they force us to recode the rest of the string into utf8 */
+		/* Insert oct or hex escaped character.  There will always be
+		 * enough room in sv since such escapes will be longer than any
+		 * UTF-8 sequence they can end up as, except if they force us
+		 * to recode the rest of the string into utf8 */
 		
 		/* Here uv is the ordinal of the next character being added in
-		 * unicode (converted from native).  (It has to be done before
-		 * here because \N is interpreted as unicode, and oct and hex
-		 * as native.) */
+		 * unicode (converted from native). */
 		if (!UNI_IS_INVARIANT(uv)) {
 		    if (!has_utf8 && uv > 255) {
 			/* Might need to recode whatever we have accumulated so
@@ -2929,92 +2942,289 @@ S_scan_const(pTHX_ char *start)
 		}
 		continue;
 
-	    /* \N{LATIN SMALL LETTER A} is a named character, and so is
-	     * \N{U+0041} */
  	    case 'N':
- 		++s;
- 		if (*s == '{') {
- 		    char* e = strchr(s, '}');
- 		    SV *res;
- 		    STRLEN len;
- 		    const char *str;
-
- 		    if (!e) {
+		/* In a non-pattern \N must be a named character, like \N{LATIN
+		 * SMALL LETTER A} or \N{U+0041}.  For patterns, it also can
+		 * mean to match a non-newline.  For non-patterns, named
+		 * characters are converted to their string equivalents. In
+		 * patterns, named characters are not converted to their
+		 * ultimate forms for the same reasons that other escapes
+		 * aren't.  Instead, they are converted to the \N{U+...} form
+		 * to get the value from the charnames that is in effect right
+		 * now, while preserving the fact that it was a named character
+		 * so that the regex compiler knows this */
+
+		/* This section of code doesn't generally use the
+		 * NATIVE_TO_NEED() macro to transform the input.  I (khw) did
+		 * a close examination of this macro and determined it is a
+		 * no-op except on utfebcdic variant characters.  Every
+		 * character generated by this that would normally need to be
+		 * enclosed by this macro is invariant, so the macro is not
+		 * needed, and would complicate use of copy(). There are other
+		 * parts of this file where the macro is used inconsistently,
+		 * but are saved by it being a no-op */
+
+		/* The structure of this section of code (besides checking for
+		 * errors and upgrading to utf8) is:
+		 *  Further disambiguate between the two meanings of \N, and if
+		 *	not a charname, go process it elsewhere
+		 *  If of form \N{U+...}, pass it through if a pattern; otherwise
+		 *	convert to utf8
+		 *  Otherwise must be \N{NAME}: convert to \N{U+c1.c2...} if a pattern;
+		 *	otherwise convert to utf8 */
+
+		/* Here, s points to the 'N'; the test below is guaranteed to
+		 * succeed if we are being called on a pattern as we already
+		 * know from a test above that the next character is a '{'.
+		 * On a non-pattern \N must mean 'named sequence, which
+		 * requires braces */
+		s++;
+		if (*s != '{') {
+		    yyerror("Missing braces on \\N{}"); 
+		    continue;
+		}
+		s++;
+
+		/* If there is no matching '}', it is an error outside of a
+		 * pattern, or ambiguous inside. */
+		if (! (e = strchr(s, '}'))) {
+		    if (! PL_lex_inpat) {
 			yyerror("Missing right brace on \\N{}");
-			e = s - 1;
-			goto cont_scan;
+			continue;
 		    }
-		    if (e > s + 2 && s[1] == 'U' && s[2] == '+') {
-			/* \N{U+...} The ... is a unicode value even on EBCDIC
-			 * machines */
-		        I32 flags = PERL_SCAN_ALLOW_UNDERSCORES |
-			  PERL_SCAN_DISALLOW_PREFIX;
-		        s += 3;
-			len = e - s;
-			uv = grok_hex(s, &len, &flags, NULL);
-			if ( e > s && len != (STRLEN)(e - s) ) {
-			    uv = 0xFFFD;
+		    else {
+
+			/* A missing brace means it can't be a legal character
+			 * name, and it could be a legal "match non-newline".
+			 * But it's kind of weird without an unescaped left
+			 * brace, so warn. */
+			if (ckWARN(WARN_SYNTAX)) {
+			    Perl_warner(aTHX_ packWARN(WARN_SYNTAX),
+				    "Missing right brace on \\N{} or unescaped left brace after \\N.  Assuming the latter");
 			}
-			s = e + 1;
-			goto NUM_ESCAPE_INSERT;
+			s -= 3; /* Backup over cur char, {, N, to the '\' */
+			*d++ = NATIVE_TO_NEED(has_utf8,'\\');
+			goto default_action;
 		    }
-		    res = newSVpvn(s + 1, e - s - 1);
-		    res = new_constant( NULL, 0, "charnames",
-					res, NULL, s - 2, e - s + 3 );
-		    if (has_utf8)
-			sv_utf8_upgrade(res);
-		    str = SvPV_const(res,len);
-#ifdef EBCDIC_NEVER_MIND
-		    /* charnames uses pack U and that has been
-		     * recently changed to do the below uni->native
-		     * mapping, so this would be redundant (and wrong,
-		     * the code point would be doubly converted).
-		     * But leave this in just in case the pack U change
-		     * gets revoked, but the semantics is still
-		     * desireable for charnames. --jhi */
-		    {
-			 UV uv = utf8_to_uvchr((const U8*)str, 0);
+		}
 
-			 if (uv < 0x100) {
-			      U8 tmpbuf[UTF8_MAXBYTES+1], *d;
+		/* Here it looks like a named character */
 
-			      d = uvchr_to_utf8(tmpbuf, UNI_TO_NATIVE(uv));
-			      sv_setpvn(res, (char *)tmpbuf, d - tmpbuf);
-			      str = SvPV_const(res, len);
-			 }
-		    }
-#endif
-		    /* If destination is not in utf8 but this new character is,
-		     * recode the dest to utf8 */
-		    if (!has_utf8 && SvUTF8(res)) {
+		if (PL_lex_inpat) {
+
+		    /* XXX This block is temporary code.  \N{} implies that the
+		     * pattern is to have Unicode semantics, and therefore
+		     * currently has to be encoded in utf8.  By putting it in
+		     * utf8 now, we save a whole pass in the regular expression
+		     * compiler.  Once that code is changed so Unicode
+		     * semantics doesn't necessarily have to be in utf8, this
+		     * block should be removed */
+		    if (!has_utf8) {
 			SvCUR_set(sv, d - SvPVX_const(sv));
 			SvPOK_on(sv);
 			*d = '\0';
 			/* See Note on sizing above.  */
 			sv_utf8_upgrade_flags_grow(sv,
-					    SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
-					    len + (STRLEN)(send - s) + 1);
+					SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
+					/* 5 = '\N{' + cur char + NUL */
+					(STRLEN)(send - s) + 5);
 			d = SvPVX(sv) + SvCUR(sv);
 			has_utf8 = TRUE;
-		    } else if (len > (STRLEN)(e - s + 4)) { /* I _guess_ 4 is \N{} --jhi */
+		    }
+		}
 
-			/* See Note on sizing above.  (NOTE: SvCUR() is not set
-			 * correctly here). */
-			const STRLEN off = d - SvPVX_const(sv);
-			d = SvGROW(sv, off + len + (STRLEN)(send - s) + 1) + off;
+		if (*s == 'U' && s[1] == '+') { /* \N{U+...} */
+		    I32 flags = PERL_SCAN_ALLOW_UNDERSCORES
+				| PERL_SCAN_DISALLOW_PREFIX;
+		    STRLEN len;
+
+		    /* For \N{U+...}, the '...' is a unicode value even on
+		     * EBCDIC machines */
+		    s += 2;	    /* Skip to next char after the 'U+' */
+		    len = e - s;
+		    uv = grok_hex(s, &len, &flags, NULL);
+		    if (len == 0 || len != (STRLEN)(e - s)) {
+			yyerror("Invalid hexadecimal number in \\N{U+...}");
+			s = e + 1;
+			continue;
+		    }
+
+		    if (PL_lex_inpat) {
+
+			/* Pass through to the regex compiler unchanged.  The
+			 * reason we evaluated the number above is to make sure
+			 * there wasn't a syntax error.  It also makes sure
+			 * that the syntax created below, \N{Uc1.c2...}, is
+			 * internal-only */
+			s -= 5;	    /* Include the '\N{U+' */
+			Copy(s, d, e - s + 1, char);	/* 1 = include the } */
+			d += e - s + 1;
+		    }
+		    else {  /* Not a pattern: convert the hex to string */
+
+			 /* If destination is not in utf8, unconditionally
+			  * recode it to be so.  This is because \N{} implies
+			  * Unicode semantics, and scalars have to be in utf8
+			  * to guarantee those semantics */
+			if (! has_utf8) {
+			    SvCUR_set(sv, d - SvPVX_const(sv));
+			    SvPOK_on(sv);
+			    *d = '\0';
+			    /* See Note on sizing above.  */
+			    sv_utf8_upgrade_flags_grow(
+					sv,
+					SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
+					UNISKIP(uv) + (STRLEN)(send - e) + 1);
+			    d = SvPVX(sv) + SvCUR(sv);
+			    has_utf8 = TRUE;
+			}
+
+			/* Add the string to the output */
+			if (UNI_IS_INVARIANT(uv)) {
+			    *d++ = (char) uv;
+			}
+			else d = (char*)uvuni_to_utf8((U8*)d, uv);
+		    }
+		}
+		else { /* Here is \N{NAME} but not \N{U+...}. */
+
+		    SV *res;		/* result from charnames */
+		    const char *str;    /* the string in 'res' */
+		    STRLEN len;		/* its length */
+
+		    /* Get the value for NAME */
+		    res = newSVpvn(s, e - s);
+		    res = new_constant( NULL, 0, "charnames",
+					/* includes all of: \N{...} */
+					res, NULL, s - 3, e - s + 4 );
+
+		    /* Most likely res will be in utf8 already since the
+		     * standard charnames uses pack U, but a custom translator
+		     * can leave it otherwise, so make sure.  XXX This can be
+		     * revisited to not have charnames use utf8 for characters
+		     * that don't need it when regexes don't have to be in utf8
+		     * for Unicode semantics.  If doing so, remember EBCDIC */
+		    sv_utf8_upgrade(res);
+		    str = SvPV_const(res, len);
+
+		    /* Don't accept malformed input */
+		    if (! is_utf8_string((U8 *) str, len)) {
+			yyerror("Malformed UTF-8 returned by \\N");
+		    }
+		    else if (PL_lex_inpat) {
+
+			if (! len) { /* The name resolved to an empty string */
+			    Copy("\\N{}", d, 4, char);
+			    d += 4;
+			}
+			else {
+			    /* In order to not lose information for the regex
+			    * compiler, pass the result in the specially made
+			    * syntax: \N{U+c1.c2.c3...}, where c1 etc. are
+			    * the code points in hex of each character
+			    * returned by charnames */
+
+			    const char *str_end = str + len;
+			    STRLEN char_length;	    /* cur char's byte length */
+			    STRLEN output_length;   /* and the number of bytes
+						       after this is translated
+						       into hex digits */
+			    const STRLEN off = d - SvPVX_const(sv);
+
+			    /* 2 hex per byte; 2 chars for '\N'; 2 chars for
+			     * max('U+', '.'); and 1 for NUL */
+			    char hex_string[2 * UTF8_MAXBYTES + 5];
+
+			    /* Get the first character of the result. */
+			    U32 uv = utf8n_to_uvuni((U8 *) str,
+						    len,
+						    &char_length,
+						    UTF8_ALLOW_ANYUV);
+
+			    /* The call to is_utf8_string() above hopefully
+			     * guarantees that there won't be an error.  But
+			     * it's easy here to make sure.  The function just
+			     * above warns and returns 0 if invalid utf8, but
+			     * it can also return 0 if the input is validly a
+			     * NUL. Disambiguate */
+			    if (uv == 0 && NATIVE_TO_ASCII(*str) != '\0') {
+				uv = UNICODE_REPLACEMENT;
+			    }
+
+			    /* Convert first code point to hex, including the
+			     * boiler plate before it */
+			    sprintf(hex_string, "\\N{U+%X", (unsigned int) uv);
+			    output_length = strlen(hex_string);
+
+			    /* Make sure there is enough space to hold it */
+			    d = off + SvGROW(sv, off
+						 + output_length
+						 + (STRLEN)(send - e)
+						 + 2);	/* '}' + NUL */
+			    /* And output it */
+			    Copy(hex_string, d, output_length, char);
+			    d += output_length;
+
+			    /* For each subsequent character, append dot and
+			     * its ordinal in hex */
+			    while ((str += char_length) < str_end) {
+				const STRLEN off = d - SvPVX_const(sv);
+				U32 uv = utf8n_to_uvuni((U8 *) str,
+							str_end - str,
+							&char_length,
+							UTF8_ALLOW_ANYUV);
+				if (uv == 0 && NATIVE_TO_ASCII(*str) != '\0') {
+				    uv = UNICODE_REPLACEMENT;
+				}
+
+				sprintf(hex_string, ".%X", (unsigned int) uv);
+				output_length = strlen(hex_string);
+
+				d = off + SvGROW(sv, off
+						     + output_length
+						     + (STRLEN)(send - e)
+						     + 2);	/* '}' +  NUL */
+				Copy(hex_string, d, output_length, char);
+				d += output_length;
+			    }
+
+			    *d++ = '}';	/* Done.  Add the trailing brace */
+			}
+		    }
+		    else { /* Here, not in a pattern.  Convert the name to a
+			    * string. */
+
+			 /* If destination is not in utf8, unconditionally
+			  * recode it to be so.  This is because \N{} implies
+			  * Unicode semantics, and scalars have to be in utf8
+			  * to guarantee those semantics */
+			if (! has_utf8) {
+			    SvCUR_set(sv, d - SvPVX_const(sv));
+			    SvPOK_on(sv);
+			    *d = '\0';
+			    /* See Note on sizing above.  */
+			    sv_utf8_upgrade_flags_grow(sv,
+						SV_GMAGIC|SV_FORCE_UTF8_UPGRADE,
+						len + (STRLEN)(send - s) + 1);
+			    d = SvPVX(sv) + SvCUR(sv);
+			    has_utf8 = TRUE;
+			} else if (len > (STRLEN)(e - s + 4)) { /* I _guess_ 4 is \N{} --jhi */
+
+			    /* See Note on sizing above.  (NOTE: SvCUR() is not
+			     * set correctly here). */
+			    const STRLEN off = d - SvPVX_const(sv);
+			    d = off + SvGROW(sv, off + len + (STRLEN)(send - s) + 1);
+			}
+			Copy(str, d, len, char);
+			d += len;
 		    }
-#ifdef EBCDIC
-		    if (!dorange)
-			native_range = FALSE; /* \N{} is guessed to be Unicode */
-#endif
-		    Copy(str, d, len, char);
-		    d += len;
 		    SvREFCNT_dec(res);
-		  cont_scan:
-		    s = e + 1;
 		}
-		else
-		    yyerror("Missing braces on \\N{}");
+#ifdef EBCDIC
+		if (!dorange) 
+		    native_range = FALSE; /* \N{} is defined to be Unicode */
+#endif
+		s = e + 1;  /* Point to just after the '}' */
 		continue;
 
 	    /* \c is a control character */
@@ -11308,6 +11518,10 @@ S_new_constant(pTHX_ const char *s, STRLEN len, const char *key, STRLEN keylen,
  	SvREFCNT_dec(msg);
   	return sv;
     }
+
+    /* charnames doesn't work well if there have been errors found */
+    if (PL_error_count > 0 && strEQ(key,"charnames")) return res;
+
     cvp = hv_fetch(table, key, keylen, FALSE);
     if (!cvp || !SvOK(*cvp)) {
 	why1 = "$^H{";
-- 
1.5.6.3

@p5pRT
Copy link
Author

p5pRT commented Feb 18, 2010

From @khwilliamson

0002-Make-a-missing-right-brace-on-N-fatal.patch
From 898fa65e47cbfa86d84ae4fa71995f211339f3e5 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Thu, 18 Feb 2010 15:06:51 -0700
Subject: [PATCH] Make a missing right brace on \N{ fatal

It was decided that this should be a fatal error instead of a warning.

Also some comments were updated..
---
 pod/perldiag.pod |   30 +++++++++++++++---------------
 toke.c           |   33 +++++++++------------------------
 2 files changed, 24 insertions(+), 39 deletions(-)

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 486a515..4a12889 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -2512,32 +2512,32 @@ can vary from one line to the next.
 
 (F) Missing right brace in C<\x{...}>, C<\p{...}>, C<\P{...}>, or C<\N{...}>.
 
-=item Missing right brace on \\N{} or unescaped left brace after \\N.  Assuming the latter
+=item Missing right brace on \\N{} or unescaped left brace after \\N
 
-(W syntax)
-C<\N> has traditionally been followed by a name enclosed in braces,
-meaning the character (or sequence of characters) given by that name.
+(F)
+C<\N> has two meanings.
+
+The traditional one has it followed by a name enclosed
+in braces, meaning the character (or sequence of characters) given by that name.
 Thus C<\N{ASTERISK}> is another way of writing C<*>, valid in both
-double-quoted strings and regular expression patterns.
-In patterns, it doesn't have the meaning an unescaped C<*> does.
+double-quoted strings and regular expression patterns.  In patterns, it doesn't
+have the meaning an unescaped C<*> does.
 
-Starting in Perl 5.12.0, C<\N> also can have an additional meaning in patterns,
-namely to match a non-newline character.  (This is like C<.> but is not
-affected by the C</s> modifier.)
+Starting in Perl 5.12.0, C<\N> also can have an additional meaning (only) in
+patterns, namely to match a non-newline character.  (This is like C<.> but is
+not affected by the C</s> modifier.)
 
 This can lead to some ambiguities.  When C<\N> is not followed immediately by a
 left brace, Perl assumes the "match non-newline character" meaning.  Also, if
 the braces form a valid quantifier such as C<\N{3}> or C<\N{5,}>, Perl assumes
 that this means to match the given quantity of non-newlines (in these examples,
-3, and 5 or more, respectively).  In all other case, where there is a C<\N{>
+3; and 5 or more, respectively).  In all other case, where there is a C<\N{>
 and a matching C<}>, Perl assumes that a character name is desired.
 
 However, if there is no matching C<}>, Perl doesn't know if it was mistakenly
-omitted, or if "match non-newline" followed by "match a C<{>" was desired.
-It assumes the latter because that is actually a valid interpretation as
-written, unlike the other case.  If you meant the former, you need to add the
-matching right brace.  If you did mean the latter, you can silence this warning
-by writing instead C<\N\{>.
+omitted, or if "match non-newline" followed by "match a C<{>" was desired, and
+raises this error.  If you meant the former, add the right brace; if you meant
+the latter, escape the brace with a backslash, like so: C<\N\{>
 
 =item Missing right curly or square bracket
 
diff --git a/toke.c b/toke.c
index fcfdd71..997b46a 100644
--- a/toke.c
+++ b/toke.c
@@ -2968,10 +2968,10 @@ S_scan_const(pTHX_ char *start)
 		 * errors and upgrading to utf8) is:
 		 *  Further disambiguate between the two meanings of \N, and if
 		 *	not a charname, go process it elsewhere
-		 *  If of form \N{U+...}, pass it through if a pattern; otherwise
-		 *	convert to utf8
-		 *  Otherwise must be \N{NAME}: convert to \N{U+c1.c2...} if a pattern;
-		 *	otherwise convert to utf8 */
+		 *  If of form \N{U+...}, pass it through if a pattern;
+		 *	otherwise convert to utf8
+		 *  Otherwise must be \N{NAME}: convert to \N{U+c1.c2...} if a
+		 *  pattern; otherwise convert to utf8 */
 
 		/* Here, s points to the 'N'; the test below is guaranteed to
 		 * succeed if we are being called on a pattern as we already
@@ -2985,27 +2985,14 @@ S_scan_const(pTHX_ char *start)
 		}
 		s++;
 
-		/* If there is no matching '}', it is an error outside of a
-		 * pattern, or ambiguous inside. */
+		/* If there is no matching '}', it is an error. */
 		if (! (e = strchr(s, '}'))) {
 		    if (! PL_lex_inpat) {
 			yyerror("Missing right brace on \\N{}");
-			continue;
-		    }
-		    else {
-
-			/* A missing brace means it can't be a legal character
-			 * name, and it could be a legal "match non-newline".
-			 * But it's kind of weird without an unescaped left
-			 * brace, so warn. */
-			if (ckWARN(WARN_SYNTAX)) {
-			    Perl_warner(aTHX_ packWARN(WARN_SYNTAX),
-				    "Missing right brace on \\N{} or unescaped left brace after \\N.  Assuming the latter");
-			}
-			s -= 3; /* Backup over cur char, {, N, to the '\' */
-			*d++ = NATIVE_TO_NEED(has_utf8,'\\');
-			goto default_action;
+		    } else {
+			yyerror("Missing right brace on \\N{} or unescaped left brace after \\N.");
 		    }
+		    continue;
 		}
 
 		/* Here it looks like a named character */
@@ -3053,9 +3040,7 @@ S_scan_const(pTHX_ char *start)
 
 			/* Pass through to the regex compiler unchanged.  The
 			 * reason we evaluated the number above is to make sure
-			 * there wasn't a syntax error.  It also makes sure
-			 * that the syntax created below, \N{Uc1.c2...}, is
-			 * internal-only */
+			 * there wasn't a syntax error. */
 			s -= 5;	    /* Include the '\N{U+' */
 			Copy(s, d, e - s + 1, char);	/* 1 = include the } */
 			d += e - s + 1;
-- 
1.5.6.3

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @davidnicol

On Wed, Feb 17, 2010 at 8​:46 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

How does $x="\\N"; /$x{...}/ fit in there? [^\n] or "Charnames and you're
silly not to escape your brackets"?

An excellent question. Hopefully, in the future, when regex
interpolation is no longer done by stringify-and-reparse,

  $s=qq'\\N'; m/$s{...}/ or next; # will be compiled to regex in the
second expression

while

  $r=qr'\N'; m/$r{...}/ or next; # will be compiled to regex in the
first expression, making the curlies indicate counting

regardless of what \N means. If it means (?-s​:.) or [^\n] or something else.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @khwilliamson

David Nicol wrote​:

On Wed, Feb 17, 2010 at 8​:46 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

How does $x="\\N"; /$x{...}/ fit in there? [^\n] or "Charnames and you're
silly not to escape your brackets"?

An excellent question. Hopefully, in the future, when regex
interpolation is no longer done by stringify-and-reparse,

$s=qq'\\N'; m/$s{...}/ or next; # will be compiled to regex in the
second expression

while

$r=qr'\N'; m/$r{...}/ or next; # will be compiled to regex in the
first expression, making the curlies indicate counting

regardless of what \N means. If it means (?-s​:.) or [^\n] or something else.

Indeed it is an excellent question; and in the present, I'm curious what
does happen on 5.8 with this. Could someone run these on that release
(compiled with debugging) and report back the results?

  perl -w -Dr -e 'my $x="\\N"; qr/${x}{U+41}/'
  perl -w -Dr -e 'use charnames "​:full";my $x="\\N"; qr/${x}{SPACE}/'

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @Abigail

On Thu, Feb 18, 2010 at 07​:01​:57PM -0600, David Nicol wrote​:

On Wed, Feb 17, 2010 at 8​:46 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

How does $x="\\N"; /$x{...}/ fit in there? [^\n] or "Charnames and you're
silly not to escape your brackets"?

Neither. It will do a hash lookup in %x.

You'd have to write it as /${x}{...}/.

An excellent question. Hopefully, in the future, when regex
interpolation is no longer done by stringify-and-reparse,

$s=qq'\\N'; m/$s{...}/ or next; # will be compiled to regex in the
second expression

while

$r=qr'\N'; m/$r{...}/ or next; # will be compiled to regex in the
first expression, making the curlies indicate counting

regardless of what \N means. If it means (?-s​:.) or [^\n] or something else.

If you are constructing regexes by pasting together strings, there are
many pitfalls, not just \N. Consider​:

  $s = 1; /(.)\1$s/

This was one of the reasons we got \g. Which would break the following
code​:

  $s = '\\g'; /${s}{foo}/

But that wasn't a reason not to have progress.

And we don't have to wait for interpolation to go beyond stringify-and-reparse
for

  $r = qr /\N/; m /${r}{...}/

to be unambiguous as $r will stringify to (?-xism​:\N), and there will be
no \N{...} seen by the regexp compiler.

Abigail

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @rgarcia

On 18 February 2010 23​:14, karl williamson <public@​khwilliamson.com> wrote​:

Attached are two patches.  The first is essentially a rebase of the
previously submitted patch to the latest blead, plus some additional tests.
 This means it takes into account Rafael's patch to cause charnames not to
be called if there is an error (which prevents it from working properly),
but since all that parsing has been moved to toke.c, it does it differently.

The second patch changes to fatal the warning when there is a '\N{' without
a matching right brace.

I notice this :

+ /* charnames doesn't work well if there have been errors found */
+ if (PL_error_count > 0 && strEQ(key,"charnames")) return res;

Here, res is not initialized and we'll return a random value.
Also, I suppose this is supposed to replace the chunk in regcomp.c
that I added in
http​://perl5.git.perl.org/perl.git/commitdiff/78c4a74a09b8f7ed410a879bd78dfb83cbf7861c
but what happens in that case with your patch​:

$ ./perl -Ilib -Mcharnames=​:full -Mstrict -e '$x=qr/\N{LATIN CAPITAL LETTER A}/'
perl​: sv.c​:3226​: Perl_sv_utf8_upgrade_flags_grow​: Assertion `sv' failed.
Aborted

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @rgarcia

On 19 February 2010 10​:39, Rafael Garcia-Suarez <rgs@​consttype.org> wrote​:

On 18 February 2010 23​:14, karl williamson <public@​khwilliamson.com> wrote​:

Attached are two patches.  The first is essentially a rebase of the
previously submitted patch to the latest blead, plus some additional tests.
 This means it takes into account Rafael's patch to cause charnames not to
be called if there is an error (which prevents it from working properly),
but since all that parsing has been moved to toke.c, it does it differently.

The second patch changes to fatal the warning when there is a '\N{' without
a matching right brace.

I notice this :

+    /* charnames doesn't work well if there have been errors found */
+    if (PL_error_count > 0 && strEQ(key,"charnames")) return res;

Here, res is not initialized and we'll return a random value.
Also, I suppose this is supposed to replace the chunk in regcomp.c
that I added in
http​://perl5.git.perl.org/perl.git/commitdiff/78c4a74a09b8f7ed410a879bd78dfb83cbf7861c
but what happens in that case with your patch​:

$ ./perl -Ilib -Mcharnames=​:full -Mstrict -e '$x=qr/\N{LATIN CAPITAL LETTER A}/'
perl​: sv.c​:3226​: Perl_sv_utf8_upgrade_flags_grow​: Assertion `sv' failed.
Aborted

The fix was trivial, see
http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @nwc10

On Fri, Feb 19, 2010 at 11​:30​:04AM +0100, Rafael Garcia-Suarez wrote​:

The fix was trivial, see
http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!

Does regcurly really need to be part of the public API?
ie instead of 'P', can it be 'p', non-static, but private.

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @nwc10

On Fri, Feb 19, 2010 at 10​:35​:19AM +0000, Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 11​:30​:04AM +0100, Rafael Garcia-Suarez wrote​:

The fix was trivial, see
http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!

Does regcurly really need to be part of the public API?
ie instead of 'P', can it be 'p', non-static, but private.

Rafael pointed out on IRC that I'm confusing 'A' and 'P'.
It's not 'A'.*

The confusion arose because I mis-interpreted
"make public the function regcurly()"

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @rgs

Fixed by change ff3f963

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

@rgs - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this as completed Feb 19, 2010
@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @khwilliamson

Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 10​:35​:19AM +0000, Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 11​:30​:04AM +0100, Rafael Garcia-Suarez wrote​:

The fix was trivial, see
http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!
Does regcurly really need to be part of the public API?
ie instead of 'P', can it be 'p', non-static, but private.

Rafael pointed out on IRC that I'm confusing 'A' and 'P'.
It's not 'A'.*

The confusion arose because I mis-interpreted
"make public the function regcurly()"

Nicholas Clark

regcurly() does need to be E so that 'use re debug' works; this is
something I forgot to mention was an additional change from the patch
submitted a while back (and the only one not mentioned).

But this brings up a question. My patch to the \X regex handling added
several functions. These ended up part of the public API, which I would
rather they not be. I believe that making them E instead.

But I haven't mentioned it before because I thought it was too late to
change for 5.12. Is that correct? It would be better to not put
something in the public API that doesn't belong there.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @rgarcia

On 19 February 2010 16​:40, karl williamson <public@​khwilliamson.com> wrote​:

Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 10​:35​:19AM +0000, Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 11​:30​:04AM +0100, Rafael Garcia-Suarez wrote​:

The fix was trivial, see

http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!

Does regcurly really need to be part of the public API?
ie instead of 'P', can it be 'p', non-static, but private.

Rafael pointed out on IRC that I'm confusing 'A' and 'P'.
It's not 'A'.*

The confusion arose because I mis-interpreted
"make public the function regcurly()"

Nicholas Clark

regcurly() does need to be E so that 'use re debug' works; this is something
I forgot to mention was an additional change from the patch submitted a
while back (and the only one not mentioned).

But this brings up a question.  My patch to the \X regex handling added
several functions.  These ended up part of the public API, which I would
rather they not be.  I believe that making them E instead.

But I haven't mentioned it before because I thought it was too late to
change for 5.12.  Is that correct?  It would be better to not put something
in the public API that doesn't belong there.

No, I don't think it's too late.

By the way, this issue came up on IRC today :

m/^\N {3}/x does not behave the same as m/^\N{3}/, although that
works for \S. I believe that should be corrected (just having no time
to look at that right now)

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @khwilliamson

Rafael Garcia-Suarez wrote​:

On 19 February 2010 16​:40, karl williamson <public@​khwilliamson.com> wrote​:

Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 10​:35​:19AM +0000, Nicholas Clark wrote​:

On Fri, Feb 19, 2010 at 11​:30​:04AM +0100, Rafael Garcia-Suarez wrote​:

The fix was trivial, see

http​://perl5.git.perl.org/perl.git/commitdiff/f5a573297aad004c6761b844d65a3e6d8402cd50

Thanks, both patches applied to bleadperl. Just in time for the next
dev release!
Does regcurly really need to be part of the public API?
ie instead of 'P', can it be 'p', non-static, but private.
Rafael pointed out on IRC that I'm confusing 'A' and 'P'.
It's not 'A'.*

The confusion arose because I mis-interpreted
"make public the function regcurly()"

Nicholas Clark

regcurly() does need to be E so that 'use re debug' works; this is something
I forgot to mention was an additional change from the patch submitted a
while back (and the only one not mentioned).

But this brings up a question. My patch to the \X regex handling added
several functions. These ended up part of the public API, which I would
rather they not be. I believe that making them E instead.

But I haven't mentioned it before because I thought it was too late to
change for 5.12. Is that correct? It would be better to not put something
in the public API that doesn't belong there.

No, I don't think it's too late.

By the way, this issue came up on IRC today :

m/^\N {3}/x does not behave the same as m/^\N{3}/, although that
works for \S. I believe that should be corrected (just having no time
to look at that right now)

I had thought of this, but thought that the pre-existing behavior of \N
meaning [^\n] must be correct, so I didn't change it. It's a trivial
change. Note as well that before the introduction of the [^\n] meaning,
no separation was allowed between \N and '{', even under /x (you get a
missing braces message); this is consistent with the \x{...} construct.
  The brace must immediately follow the x.

My proposed change would allow space between the N and the brace for
both meanings under /x. But you would get an error message if the
character name wasn't of the form U+...

This is in keeping with a patch I've been preparing. Yesterday it
started dawning on me that the regex compiler needs to do something
different than what my submitted patch does for the case of a
single-quotish thing getting passed in. When I was writing it, I did
not realize that it was possible to bypass the tokenizer parsing, and so
what regcomp.c does is assume that the input has been parsed, and so if
there is a \N not of the form \N{U+...}, it assumes it means [^\n], and
doesn't check. Instead, I'll check and if necessary, raise an error.
This is in keeping with what 5.8 does for single quotish input.

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @khwilliamson

Attached is a patch to the recent #56444 patch to add the /x handling
that Rafael mentioned today, as well as to improve the handling of
strings that somehow bypass the lexer.

I decided not to do quite what I said in an earlier post today. I think
it is better to not allow spaces between the \N and { for the named
character case even under the /x modifier. This makes it consistent with
the other similar constructs, such as \x{...}.

I'm not too happy with the wording of the error message "\N{NAME} must
be resolved by the lexer". Feel free to improve it. I used the term
'regex' not 'regexp' in perldiag.pod. I remember someone wanting to
standardize on the latter term somewhere, but perldiag.pod has both, so
I'm not sure what should be done. I prefer the former because it's
easier for me to pronounce (I was a radio announcer in college.)

@p5pRT
Copy link
Author

p5pRT commented Feb 19, 2010

From @khwilliamson

0001-Improve-handling-of-qq-N-.-and-x.patch
From 2aeb25e38176af47d4f97c0ac9845b5257800a09 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@khw-desktop.(none)>
Date: Fri, 19 Feb 2010 14:42:16 -0700
Subject: [PATCH] Improve handling of qq(\N{...}); and /x

It is possible to bypass the lexer's parsing of \N.  This patch causes
the regex compiler to deal with that better.  The compiler no longer
assumes that the lexer parsed the \N.  It generates an error message if
the \N isn't in a form it is expecting, and invalid hexadecimal digits
are now fatal errors, with the position of the error more clearly
marked.

The diagnostic pod has been updated to reflect the new error messages,
with some slight clarifications to the previous ones as well.
---
 pod/perldiag.pod |   54 +++++++++++++++++++++++++++++++++--------
 regcomp.c        |   70 ++++++++++++++++++++++++++++++++++++++++-------------
 t/re/re_tests    |   31 +++++++++++++++++++++++-
 3 files changed, 126 insertions(+), 29 deletions(-)

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 4a12889..95b45f7 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -1912,7 +1912,7 @@ about 250 characters for simple names, and somewhat more for compound
 names (like C<$A::B>).  You've exceeded Perl's limits.  Future versions
 of Perl are likely to eliminate these arbitrary limitations.
 
-=item Ignoring zero length \N{} in character class"
+=item Ignoring zero length \N{} in character class
 
 (W) Named Unicode character escapes (\N{...}) may return a
 zero length sequence.  When such an escape is used in a character class
@@ -2474,7 +2474,10 @@ immediately after the switch, without intervening spaces.
 =item Missing braces on \N{}
 
 (F) Wrong syntax of character name literal C<\N{charname}> within
-double-quotish context.
+double-quotish context.  This can also happen when there is a space (or
+comment) between the C<\N> and the C<{> in a regex with the C</x> modifier.
+This modifier does not change the requirement that the brace immediately follow
+the C<\N>.
 
 =item Missing comma after first argument to %s function
 
@@ -2524,18 +2527,18 @@ double-quoted strings and regular expression patterns.  In patterns, it doesn't
 have the meaning an unescaped C<*> does.
 
 Starting in Perl 5.12.0, C<\N> also can have an additional meaning (only) in
-patterns, namely to match a non-newline character.  (This is like C<.> but is
-not affected by the C</s> modifier.)
+patterns, namely to match a non-newline character.  (This is short for
+C<[^\n]>, and like C<.> but is not affected by the C</s> regex modifier.)
 
 This can lead to some ambiguities.  When C<\N> is not followed immediately by a
-left brace, Perl assumes the "match non-newline character" meaning.  Also, if
+left brace, Perl assumes the C<[^\n]> meaning.  Also, if
 the braces form a valid quantifier such as C<\N{3}> or C<\N{5,}>, Perl assumes
 that this means to match the given quantity of non-newlines (in these examples,
 3; and 5 or more, respectively).  In all other case, where there is a C<\N{>
 and a matching C<}>, Perl assumes that a character name is desired.
 
 However, if there is no matching C<}>, Perl doesn't know if it was mistakenly
-omitted, or if "match non-newline" followed by "match a C<{>" was desired, and
+omitted, or if C<[^\n]{> was desired, and
 raises this error.  If you meant the former, add the right brace; if you meant
 the latter, escape the brace with a backslash, like so: C<\N\{>
 
@@ -2626,10 +2629,38 @@ local() if you want to localize a package variable.
 
 =item \\N in a character class must be a named character: \\N{...}
 
-The new (5.12) meaning of C<\N> to match non-newlines is not valid in a
-bracketed character class, for the same reason that C<.> in a character class
-loses its specialness: it matches almost everything, which is probably not what
-you want.
+(F) The new (5.12) meaning of C<\N> as C<[^\n]> is not valid in a bracketed
+character class, for the same reason that C<.> in a character class loses its
+specialness: it matches almost everything, which is probably not what you want.
+
+=item \\N{NAME} must be resolved by the lexer
+
+(F) When compiling a regex pattern, an unresolved named character or sequence
+was encountered.  This can happen in any of several ways that bypass the lexer,
+such as using single-quotish context:
+
+    $re = '\N{SPACE}';	# Wrong!
+    /$re/;
+
+Instead, use double-quotes:
+
+    $re = "\N{SPACE}";	# ok
+    /$re/;
+
+The lexer can be bypassed as well by creating the pattern from smaller
+components:
+
+    $re = '\N';
+    /${re}{SPACE}/;	# Wrong!
+
+It's not a good idea to split a construct in the middle like this, and it
+doesn't work here.  Instead use the solution above.
+
+Finally, the message also can happen under the C</x> regex modifier when the
+C<\N> is separated by spaces from the C<{>, in which case, remove the spaces.
+
+    /\N {SPACE}/x;	# Wrong!
+    /\N{SPACE}/x;	# ok
 
 =item Name "%s::%s" used only once: possible typo
 
@@ -2646,7 +2677,8 @@ will not trigger this warning.
 =item Invalid hexadecimal number in \\N{U+...}
 
 (F) The character constant represented by C<...> is not a valid hexadecimal
-number.
+number.  Either it is empty, or you tried to use a character other than 0 - 9
+or A - F, a - f in a hexadecimal number.
 
 =item Negative '/' count in unpack
 
diff --git a/regcomp.c b/regcomp.c
index ce4104a..b5c685c 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -6625,26 +6625,29 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
 STATIC regnode *
 S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 {
-    char * endbrace;    /* endbrace following the name */
+    char * endbrace;    /* '}' following the name */
     regnode *ret = NULL;
 #ifdef DEBUGGING
     char* parse_start = RExC_parse - 2;	    /* points to the '\N' */
 #endif
+    char* p;
 
     GET_RE_DEBUG_FLAGS_DECL;
  
     PERL_ARGS_ASSERT_REG_NAMEDSEQ;
 
     GET_RE_DEBUG_FLAGS;
+
+    /* The [^\n] meaning of \N ignores spaces and comments under the /x
+     * modifier.  The other meaning does not */
+    p = (RExC_flags & RXf_PMf_EXTENDED)
+	? regwhite( pRExC_state, RExC_parse )
+	: RExC_parse;
    
     /* Disambiguate between \N meaning a named character versus \N meaning
-     * don't match a newline. */
-    if (*RExC_parse != '{'
-	|| (! (endbrace = strchr(RExC_parse, '}'))) /* no trailing brace */
-	|| ! (endbrace == RExC_parse + 1	/* nothing between the {} */
-	      || (endbrace - RExC_parse > 3	/* U+ and at least one hex */
-		  && strnEQ(RExC_parse + 1, "U+", 2))))
-    {
+     * [^\n].  The former is assumed when it can't be the latter. */
+    if (*p != '{' || regcurly(p)) {
+	RExC_parse = p;
 	if (valuep) {
 	    /* no bare \N in a charclass */
 	    vFAIL("\\N in a character class must be a named character: \\N{...}");
@@ -6658,8 +6661,27 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 	return ret;
     }
 
-    /* Here, we have decided it is a named sequence */
+    /* Here, we have decided it should be a named sequence */
+
+    /* The test above made sure that the next real character is a '{', but
+     * under the /x modifier, it could be separated by space (or a comment and
+     * \n) and this is not allowed (for consistency with \x{...} and the
+     * tokenizer handling of \N{NAME}). */
+    if (*RExC_parse != '{') {
+	vFAIL("Missing braces on \\N{}");
+    }
+
     RExC_parse++;	/* Skip past the '{' */
+
+    if (! (endbrace = strchr(RExC_parse, '}')) /* no trailing brace */
+	|| ! (endbrace == RExC_parse		/* nothing between the {} */
+	      || (endbrace - RExC_parse >= 2	/* U+ (bad hex is checked below */
+		  && strnEQ(RExC_parse, "U+", 2)))) /* for a better error msg) */
+    {
+	if (endbrace) RExC_parse = endbrace;	/* position msg's '<--HERE' */
+	vFAIL("\\N{NAME} must be resolved by the lexer");
+    }
+
     if (endbrace == RExC_parse) {   /* empty: \N{} */
 	if (! valuep) {
 	    RExC_parse = endbrace + 1;  
@@ -6703,8 +6725,16 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 
 	/* The tokenizer should have guaranteed validity, but it's possible to
 	 * bypass it by using single quoting, so check */
-	if ( length_of_hex != (STRLEN)(endchar - RExC_parse) ) {
-	    *valuep = UNICODE_REPLACEMENT;
+	if (length_of_hex == 0
+	    || length_of_hex != (STRLEN)(endchar - RExC_parse) )
+	{
+	    RExC_parse += length_of_hex;	/* Includes all the valid */
+	    RExC_parse += (RExC_orig_utf8)	/* point to after 1st invalid */
+			    ? UTF8SKIP(RExC_parse)
+			    : 1;
+	    /* Guard against malformed utf8 */
+	    if (RExC_parse >= endchar) RExC_parse = endchar;
+	    vFAIL("Invalid hexadecimal number in \\N{U+...}");
 	}    
 
 	RExC_parse = endbrace + 1;
@@ -6731,7 +6761,7 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 	 * is primarily a named character, and not intended to be a huge long
 	 * string, so 255 bytes should be good enough */
 	while (1) {
-	    STRLEN this_char_length;
+	    STRLEN length_of_hex;
 	    I32 grok_flags = PERL_SCAN_ALLOW_UNDERSCORES
 			    | PERL_SCAN_DISALLOW_PREFIX
 			    | (SIZE_ONLY ? PERL_SCAN_SILENT_ILLDIGIT : 0);
@@ -6743,12 +6773,18 @@ S_reg_namedseq(pTHX_ RExC_state_t *pRExC_state, UV *valuep, I32 *flagp)
 	    if (! endchar) endchar = endbrace;
 
 	    /* The values are Unicode even on EBCDIC machines */
-	    this_char_length = (STRLEN)(endchar - RExC_parse);
-	    cp = grok_hex(RExC_parse, &this_char_length, &grok_flags, NULL);
-	    if ( this_char_length == 0 
-		|| this_char_length != (STRLEN)(endchar - RExC_parse) )
+	    length_of_hex = (STRLEN)(endchar - RExC_parse);
+	    cp = grok_hex(RExC_parse, &length_of_hex, &grok_flags, NULL);
+	    if ( length_of_hex == 0 
+		|| length_of_hex != (STRLEN)(endchar - RExC_parse) )
 	    {
-		cp = UNICODE_REPLACEMENT;   /* Substitute a valid character */
+		RExC_parse += length_of_hex;	    /* Includes all the valid */
+		RExC_parse += (RExC_orig_utf8)	/* point to after 1st invalid */
+				? UTF8SKIP(RExC_parse)
+				: 1;
+		/* Guard against malformed utf8 */
+		if (RExC_parse >= endchar) RExC_parse = endchar;
+		vFAIL("Invalid hexadecimal number in \\N{U+...}");
 	    }    
 
 	    if (! FOLD) {	/* Not folding, just append to the string */
diff --git a/t/re/re_tests b/t/re/re_tests
index 6304fe6..1807ffc 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -34,9 +34,15 @@ ab*bc	abbbbc	y	$+[0]	6
 \N{1}	abbbbc	y	$&	a
 \N{1}	abbbbc	y	$-[0]	0
 \N{1}	abbbbc	y	$+[0]	1
+/\N {1}/x	abbbbc	y	$&	a
+/\N {1}/x	abbbbc	y	$-[0]	0
+/\N {1}/x	abbbbc	y	$+[0]	1
 \N{3,4}	abbbbc	y	$&	abbb
 \N{3,4}	abbbbc	y	$-[0]	0
 \N{3,4}	abbbbc	y	$+[0]	4
+/\N {3,4}/x	abbbbc	y	$&	abbb
+/\N {3,4}/x	abbbbc	y	$-[0]	0
+/\N {3,4}/x	abbbbc	y	$+[0]	4
 ab{0,}bc	abbbbc	y	$&	abbbbc
 ab{0,}bc	abbbbc	y	$-[0]	0
 ab{0,}bc	abbbbc	y	$+[0]	6
@@ -76,10 +82,13 @@ $	abc	y	$&
 a.c	abc	y	$&	abc
 a.c	axc	y	$&	axc
 a\Nc	abc	y	$&	abc
+/a\N c/x	abc	y	$&	abc
 a.*c	axyzc	y	$&	axyzc
 a\N*c	axyzc	y	$&	axyzc
+/a\N *c/x	axyzc	y	$&	axyzc
 a.*c	axyzd	n	-	-
 a\N*c	axyzd	n	-	-
+/a\N *c/x	axyzd	n	-	-
 a[bc]d	abc	n	-	-
 a[bc]d	abd	y	$&	abd
 a[b]d	abd	y	$&	abd
@@ -1412,12 +1421,32 @@ abc\N	abcd	y	$&	abcd
 abc\N	abc\n	n		
 
 # Verify get errors.  For these, we need // or else puts it in single quotes,
-# and doesn't expand.
+# and bypasses the lexer.
 /\N{U+}/	-	c	-	Invalid hexadecimal number
+# Below currently gives a misleading message
+/[\N{U+}]/	-	c	-	Unmatched
 /abc\N{def/	-	c	-	Missing right brace
+/\N{U+4AG3}/	-	c	-	Illegal hexadecimal digit
+/[\N{U+4AG3}]/	-	c	-	Illegal hexadecimal digit
+
+# And verify that in single quotes which bypasses the lexer, the regex compiler
+# figures it out.
+\N{U+}	-	c	-	Invalid hexadecimal number
+[\N{U+}]	-	c	-	Invalid hexadecimal number
+\N{U+4AG3}	-	c	-	Invalid hexadecimal number
+[\N{U+4AG3}]	-	c	-	Invalid hexadecimal number
+abc\N{def	-	c	-	\\N{NAME} must be resolved by the lexer
+
+# Verify that under /x that still cant have space before left brace
+/abc\N {U+41}/x	-	c	-	Missing braces
+/abc\N {SPACE}/x	-	c	-	Missing braces
 
 # Verifies catches hex errors, and doesn't expose our . notation to the outside
 /\N{U+0xBEEF}/	- 	c	-	Illegal hexadecimal digit
 /\N{U+BEEF.BEAD}/	- 	c	-	Illegal hexadecimal digit
 
+# Verify works in single quotish context; regex compiler delivers slightly different msg
+# \N{U+BEEF.BEAD} succeeds here, because can't completely hide it from the outside.
+\N{U+0xBEEF}	- 	c	-	Invalid hexadecimal number
+
 # vim: set noexpandtab
-- 
1.5.6.3

@p5pRT
Copy link
Author

p5pRT commented Feb 20, 2010

From @steve-m-hay

On 19 February 2010 21​:57, karl williamson <public@​khwilliamson.com> wrote​:

Attached is a patch to the recent #56444 patch to add the /x handling that
Rafael mentioned today, as well as to improve the handling of strings that
somehow bypass the lexer.

Thanks, applied as c3c4140.

@p5pRT
Copy link
Author

p5pRT commented Oct 5, 2014

From @khwilliamson

Adding just for the record​:

Some of the problems here turn out to be flaws in the charnames handler that were later fixed; but there still would be problems when a regex gets interpolated into another regex outside the original scope, so something like what is done here is required.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Oct 5, 2014

From [Unknown Contact. See original ticket]

Adding just for the record​:

Some of the problems here turn out to be flaws in the charnames handler that were later fixed; but there still would be problems when a regex gets interpolated into another regex outside the original scope, so something like what is done here is required.

--
Karl Williamson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant