Skip Menu |
Report information
Id: 58182
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: contact [at] khwilliamson.com
Cc:
AdminCc:

Operating System: Linux
PatchStatus: (no value)
Severity: High
Type:
Perl Version: 5.10.0
Fixed In: (no value)

Attachments
0001-Change-to-use-mnemonic-instead-of-char-constant.patch
0001-handy.h-Add-isSPACE_L1-with-Unicode-semantics.patch
0001-handy.h-Add-isSPACEU-with-Unicode-semantics.patch
0001-PATCH-perl-58182-partial.-user-defined-casing.patch
0001-regcomp.c-Remove-duplicate-statement.patch
0001-regex-case-sensitive-match-utf8ness-independent.patch
0001-Typo.patch
0002-Change-comments-documentation-for-new.patch
0002-mktables-Add-caution-comments-to-output-tables.patch
0002-perlrebackslash-Fix-grammatical-error.patch
0002-perlrebackslash-Fix-poor-grammar.patch
0002-re-reg_fold.t-use-array-size-for-test-counts.patch
0002-Use-sizeof-instead-of-hard-coded-array-size.patch
0003-Add-tested-for-corrupted-regnode.patch
0003-Change-.t-to-use-new.patch
0003-regcomp.c-Fix-white-space-cuddled-else.patch
0003-regcomp.c-rmv-trail-blanks-uncuddle-else.patch
0003-regcomp.c-typo-in-comment.patch
0004-Allocate-bit-for-u-modifier.patch
0004-Display-characters-as-Unicode-for-clarity.patch
0004-regcomp.c-teach-tries-about-EXACTFU.patch
0004-regexec.c-add-and-refactor-macros.patch
0004-Subject-PATCH-regexec.c-add-and-refactor-macros.patch
0005-Clarify-that-count-is-bytes-not-unicode-characters.patch
0005-re.pm-Change-comment-to-use-new.patch
0005-regcomp.c-utf8-pattern-defaults-to-Unicode-semantic.patch
0005-regexec.c-make-macro-lines-fit-in-80-cols.patch
0005-Subject-PATCH-regexec.c-make-macros-fit-80-cols.patch
0006-Add-d-l-u-infixed-regex-modifiers.patch
0006-Add-Unicode-semantics-to-regex-case-sensitive-matchi.patch
0006-regcomp.c-Use-latin1-folding-in-synthetic-start-cla.patch
0006-regcomp.h-Add-macro-to-retrieve-regnode-flags.patch
0006-Subject-regcomp.h-Add-macro-to-get-regnode-flags.patch
0007-Add-l-t-u-regex-modifiers.patch
0007-feature-unicode_strings.t-rmv-trail-blank.patch
0007-regcomp.c-Convert-some-things-to-use-cBOOL.patch
0007-regcomp.sym-update-comment.patch
0007-Subject-unicode_strings.t-rmv-trail-blanks.patch
0008-Change-Test-Simple-.t-fix.patch
0008-lib-feature-unicode_strings.t-Imprv-test-output.patch
0008-regcomp.sym-Add-REFFU-and-NREFFU-nodes.patch
0008-Subject-unicode_strings.t-Imprv-test-output.patch
0009-perlre.pod-slight-rewording.patch
0009-re-fold_grind.t-Refactor-to-test-utf8-patterns.patch
0009-regcomp.c-convert-to-use-cBOOL.patch
0010--perl-58182-partial-Add-unicode-s-w-matching.patch
0010-regexec.c-Handle-REFFU-and-NREFFU-refactor.patch
0010-Subject-handy.h-Add-isWORDCHAR_L1-macro.patch
0011-regcomp.c-Generate-REFFU-and-NREFFU.patch
0011-Subject-perl-58182-partial-Add-uni-s-w-matchi.patch
0012-re-fold_grind.t-Add-tests-for-NREFFU-REFFU.patch
0013-Nit-in-perlunicode.pod.patch
0014-Document-Unicode-doc-fix.patch
0015-Nit-in-perlre.pod.patch
0016-Nit-in-perlunicode.pod.patch
0017-Nit-in-perluniintro.pod.patch
patch
signature.asc



Subject: Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Wed, 20 Aug 2008 16:27:57 -0600
To: perlbug [...] perl.org
From: karl williamson <contact [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 6.9k
This is a bug report for perl from corporate@khwilliamson.com, generated with the help of perlbug 1.36 running under perl 5.10.0. ----------------------------------------------------------------- Characters in the range U+0080 through U+00FF behave inconsistently depending on whether or not they are part of a string which also includes a character above that range, and in some cases they behave incorrectly even when part of such a string. The problems I will concentrate on in this report are those involving case. I presume that they do work properly when a locale is set, but I haven't tested that. print uc("\x{e0}"), "\n"; # (a with grave accent) yields itself instead of a capital A with grave accent (U+00C0). This is true whether or not the character is part of a string which includes a character not storable in a single byte. Similarly print "\x{e0}" =~ /\x{c0}/i, "\n"; will print a null string on a line, as the match fails. The same behavior occurs for all characters in this range that are marked in the Unicode standard as lower case and have single letter upper case equivalents. The behavior that is inconsistent mostly occurs with upper case letters being mapped to lower case. print lcfirst("\x{c0}aaaaa"), "\n"; doesn't change the first character. But print lcfirst("\x{c0}aaaaa\x{101}"), "\n"; does change it. There is something seriously wrong when a character separated by an arbitrarily large distance from another one can affect what case the latter is considered to be. Similarly, print "\x{c0}aaaaaa" =~ /^\x{e0}/i, "\n"; will show the match failing, but print "\x{c0}aaaaaa\x{101}" =~ /^\x{e0}/i, "\n"; will show the match succeeding. Again a character maybe hundreds of positions further along in a string can affect whether the first character in said string matches its lower case equivalent when case is ignored. The same behavior occurs for all characters in this range that are marked in the Unicode standard as upper case and have lower case equivalents, as well as U+00DF which is lower case and has an upper case equivalent of the string 'SS'. Also, the byte character classes inconsistently match characters in this range, again depending on whether or not the character is part of a larger string that contains a character greater than the range. So, for example, for a non-breaking space, print "\xa0" =~ /^\s/, "\n"; will show that the match returns false but print "\xa0\x{101}" =~ /^\s/, "\n"; will show that the match returns true. But this behavior is sort-of documented, and there is a work-around, which is to use the '\p{}' classes instead. Note that calling them byte character classes is wrong; they really are 7-bit classes. From reading the documentation, I presume that the inconsistent behavior is a result of the decision to have perl not switch to wide-character mode in storing its strings unless necessary. I like that decision for efficiency reasons. But what has happened is that the code points in the range 128 - 255 have been orphaned, when they aren't part of strings that force the switch. Again, I presume but haven't tested, that using a locale causes them to work properly for that locale, but in the absence of a locale they should be treated as Unicode code points (or equivalently for characters in this range, as iso-8859-1). Storing as wide-characters is supposed to be transparent to users, but this bug belies that and yields very inconsistent and unexpected behavior. (This doesn't explain the lower to upper case translation bug, which is wrong even in wide-character mode.) I am frankly astonished that this bug exists, as I have come to expect perl to "Do the Right Thing" over the course of many years of using it. I did see one bug report of something similar to this when searching for this, but it apparently was misunderstood and went nowhere, and wasn't in the perl bug data base ----------------------------------------------------------------- --- Flags: category=core severity=high --- Site configuration information for perl 5.10.0: Configured by ActiveState at Wed May 14 05:06:16 PDT 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux, osvers=2.4.21-297-default, archname=i686-linux-thread-multi uname='linux gila 2.4.21-297-default #1 sat jul 23 07:47:39 utc 2005 i686 i686 i386 gnulinux ' config_args='-ders -Dcc=gcc -Dusethreads -Duseithreads -Ud_sigsetjmp -Uinstallusrbinperl -Ulocincpth= -Uloclibpth= -Accflags=-DUSE_SITECUSTOMIZE -Duselargefiles -Accflags=-DPRIVLIB_LAST_IN_INC -Dprefix=/opt/ActivePerl-5.10 -Dprivlib=/opt/ActivePerl-5.10/lib -Darchlib=/opt/ActivePerl-5.10/lib -Dsiteprefix=/opt/ActivePerl-5.10/site -Dsitelib=/opt/ActivePerl-5.10/site/lib -Dsitearch=/opt/ActivePerl-5.10/site/lib -Dsed=/bin/sed -Duseshrplib -Dcf_by=ActiveState -Dcf_email=support@ActiveState.com' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe' ccversion='', gccversion='3.3.1 (SuSE Linux)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags ='' libpth=/lib /usr/lib /usr/local/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/opt/ActivePerl-5.10/lib/CORE' cccdlflags='-fPIC', lddlflags='-shared -O2' Locally applied patches: ACTIVEPERL_LOCAL_PATCHES_ENTRY 33741 avoids segfaults invoking S_raise_signal() (on Linux) 33763 Win32 process ids can have more than 16 bits 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local --- @INC for perl 5.10.0: /opt/ActivePerl-5.10/site/lib /opt/ActivePerl-5.10/lib . --- Environment for perl 5.10.0: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/opt/ActivePerl-5.10/bin:/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Thu, 21 Aug 2008 09:50:00 +0200
To: perl5-porters [...] perl.org
From: Moritz Lenz <moritz [...] casella.verplant.org>
Download (untitled) / with headers
text/plain 1.3k
karl williamson wrote: Show quoted text
> # New Ticket Created by karl williamson > # Please include the string: [perl #58182] > # in the subject line of all future correspondence about this issue. > # <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=58182 > > > > This is a bug report for perl from corporate@khwilliamson.com, > generated with the help of perlbug 1.36 running under perl 5.10.0. > > > ----------------------------------------------------------------- > Characters in the range U+0080 through U+00FF behave inconsistently > depending on whether or not they are part of a string which also > includes a character above that range, and in some cases they behave > incorrectly even when part of such a string. The problems I will > concentrate on in this report are those involving case. > > I presume that they do work properly when a locale is set, but I haven't > tested that. > > print uc("\x{e0}"), "\n"; # (a with grave accent) > > yields itself instead of a capital A with grave accent (U+00C0). This > is true whether or not the character is part of a string which includes > a character not storable in a single byte. Similarly
This is a known bug, and probably not fixable, because too much code depends on it. See http://search.cpan.org/perldoc?Unicode::Semantics A possible workaround is my $x = "\x{e0}"; utf8::upgrade($x); say uc($x); # yields À CHeers, Moritz
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Thu, 21 Aug 2008 13:22:36 +0200
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 564b
karl williamson schreef: Show quoted text
> The behavior that is inconsistent mostly occurs with upper case > letters being mapped to lower case. > > print lcfirst("\x{c0}aaaaa"), "\n"; > > doesn't change the first character. But > > print lcfirst("\x{c0}aaaaa\x{101}"), "\n"; > > does change it.
To me that is as expected. print lcfirst substr "\x{100}\x{c0}aaaaa", 1; Lowercasing isn't defined for as many characters in ASCII or Latin-1 as it is in Unicode. Unicode semantics get activated when a codepoint above 255 is involved. -- Affijn, Ruud "Gewoon is een tijger."
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Thu, 21 Aug 2008 14:31:40 +0300
To: "Dr.Ruud" <rvtol+news [...] isolution.nl>
From: Yuval Kogman <nothingmuch [...] woobling.org>
Download (untitled) / with headers
text/plain 279b
On Thu, Aug 21, 2008 at 13:22:36 +0200, Dr.Ruud wrote: Show quoted text
> Unicode semantics get activated when a codepoint above 255 is involved.
Or a code point above 127 with use utf8 or use encoding -- Yuval Kogman <nothingmuch@woobling.org> http://nothingmuch.woobling.org 0xEBD27418
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Thu, 21 Aug 2008 14:34:18 +0300
To: "Dr.Ruud" <rvtol+news [...] isolution.nl>, perl5-porters [...] perl.org
From: Yuval Kogman <nothingmuch [...] woobling.org>
Download (untitled) / with headers
text/plain 444b
On Thu, Aug 21, 2008 at 14:31:40 +0300, Yuval Kogman wrote: Show quoted text
> Or a code point above 127 with use utf8 or use encoding
I should clarify that this is only in the context of the string constants. A code point above 127 will be treated as unicode if the string is properly marked as such, and the way to achieve that for string constants is 'use utf8'. -- Yuval Kogman <nothingmuch@woobling.org> http://nothingmuch.woobling.org 0xEBD27418
Subject: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Sat, 20 Sep 2008 16:52:02 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <contact [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.5k
I'm the person who submitted this bug report. I think this bug should be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I downloaded the Perl 5.10 source and hacked up an experimental version that seems to fix it. And now I've joined this list to see how to proceed. I don't know the protocol involved, so I'll just jump in, and hopefully that will be all right. To refresh your memory, the current implementation of perl on non-EBCDIC machines is problematic for characters in the range 128-255 when no locale is set. The slides from the talk "Working around *the* Unicode bug" during YAPC::Europe 2007 in Vienna: http://juerd.nl/files/slides/2007yapceu/unicodesemantics.html give more cases of problems than were in my bug report. The crux of the problem is that on non-EBCDIC machines, in the absence of locale, in order to have meaningful semantics, a character (or code point) has to be stored in utf8, except in pattern matching the \h, \H, \v and \V or any of the \p{} patterns. (This leads to an anomaly with the no-break space which is considered to be horizontal space (\h), but not space (\s).) (The characters also always have base semantics of having an ordinal number, and also of being not-a-anything (meaning that they all pattern match \W, \D, \S, [[:^punct]], etc.)) Perl stores characters as utf8 automatically if a string contains any code points above 255, and it is trivially true for ascii code points. That leaves a hole-in-the-doughnut of characters between 128 and 255 with behavior that varies depending on whether they are stored as utf8 or not. This is contrary, for example, to the Camel book: "character semantics are preserved at an abstract level regardless of representation" (p.403). (How they get stored depends on how they were input, or whether or not they are part of a longer string containing code points larger than 255, or if they have been explicitly set by using utf8::upgrade or utf8::downgrade.) I know of three areas where this leads to problems. The first is the pattern matching already alluded to. This is at least documented (though somewhat confusingly). And one can use the \p{} constructs to avoid the issue. The second is case changing functions, like lcfirst() or \U in pattern substitutions. And the third is ignoring case in pattern matches. There may be others which I haven't looked for yet. I think, for example, that quotemeta() will escape all these characters, though I don't believe that this causes a real problem. One response I got to my bug report was that a lot of code depends on things working the way they currently do. I'm wondering if that applies to all three of the areas, or just the first? Also, from reading the perl source, it appears to me that EBCDIC machines may work differently (and more correctly to my way of thinking) than Ascii-ish ones. An idea I've had is to add a pragma like "use latin1", or maybe "use locale unicode", or something else as a way of not breaking existing application code. Anyway, I'm hoping to get some sort of fix in for this. In my experimental implementation (which currently doesn't change EBCDIC handling), it is mostly just extending the existing definitions of ascii semantics to include the 128..255 latin1 range. Code logic changes were required only in the uc and ucfirst functions (to accommodate 3 characters which require special handling), and in the regular expression compilation (to accommodate 2 characters which need special handling). Obviously, in my ignorance, I may be missing things that others can enlighten me on. So I'd like to know how to proceed Karl Williamson
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Sat, 20 Sep 2008 16:31:57 -0700
To: karl williamson <contact [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 4.6k
On approximately 9/20/2008 3:52 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> I'm the person who submitted this bug report. I think this bug should > be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I > downloaded the Perl 5.10 source and hacked up an experimental version > that seems to fix it. And now I've joined this list to see how to > proceed. I don't know the protocol involved, so I'll just jump in, and > hopefully that will be all right. > > To refresh your memory, the current implementation of perl on non-EBCDIC > machines is problematic for characters in the range 128-255 when no > locale is set. > > The slides from the talk "Working around *the* Unicode bug" during > YAPC::Europe 2007 in Vienna: > http://juerd.nl/files/slides/2007yapceu/unicodesemantics.html > give more cases of problems than were in my bug report. > > The crux of the problem is that on non-EBCDIC machines, in the absence > of locale, in order to have meaningful semantics, a character (or code > point) has to be stored in utf8, except in pattern matching the \h, \H, > \v and \V or any of the \p{} patterns. (This leads to an anomaly with > the no-break space which is considered to be horizontal space (\h), but > not space (\s).) (The characters also always have base semantics of > having an ordinal number, and also of being not-a-anything (meaning that > they all pattern match \W, \D, \S, [[:^punct]], etc.)) > > Perl stores characters as utf8 automatically if a string contains any > code points above 255, and it is trivially true for ascii code points. > That leaves a hole-in-the-doughnut of characters between 128 and 255 > with behavior that varies depending on whether they are stored as utf8 > or not. This is contrary, for example, to the Camel book: "character > semantics are preserved at an abstract level regardless of > representation" (p.403). (How they get stored depends on how they were > input, or whether or not they are part of a longer string containing > code points larger than 255, or if they have been explicitly set by > using utf8::upgrade or utf8::downgrade.) > > I know of three areas where this leads to problems. > > The first is the pattern matching already alluded to. This is at least > documented (though somewhat confusingly). And one can use the \p{} > constructs to avoid the issue. > > The second is case changing functions, like lcfirst() or \U in pattern > substitutions. > > And the third is ignoring case in pattern matches. > > There may be others which I haven't looked for yet. I think, for > example, that quotemeta() will escape all these characters, though I > don't believe that this causes a real problem. > > One response I got to my bug report was that a lot of code depends on > things working the way they currently do. I'm wondering if that applies > to all three of the areas, or just the first? > > Also, from reading the perl source, it appears to me that EBCDIC > machines may work differently (and more correctly to my way of thinking) > than Ascii-ish ones. > > An idea I've had is to add a pragma like "use latin1", or maybe "use > locale unicode", or something else as a way of not breaking existing > application code. > > Anyway, I'm hoping to get some sort of fix in for this. In my > experimental implementation (which currently doesn't change EBCDIC > handling), it is mostly just extending the existing definitions of ascii > semantics to include the 128..255 latin1 range. Code logic changes were > required only in the uc and ucfirst functions (to accommodate 3 > characters which require special handling), and in the regular > expression compilation (to accommodate 2 characters which need special > handling). Obviously, in my ignorance, I may be missing things that > others can enlighten me on. > > So I'd like to know how to proceed > > Karl Williamson
I applaud your willingness to dive in. For compatibility reasons, as has been discussed on this list previously, a pragma of some sort must be used to request the incompatible enhancement (which you call a fix). N.B. There are lots of discussions about it in the archive, some recently, if you haven't found them, you should; if you find it hard to find them, ask, and I (or someone) will try to find the starting points for you, perhaps the summaries would be a good place to look to find the discussions; I participated in most of them. Those discussions are lengthy reading, unfortunately, but they do point out an extensive list of issues, perhaps approaching completeness. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Sun, 21 Sep 2008 06:14:56 +0200
To: karl williamson <contact [...] khwilliamson.com>
From: andreas.koenig.7os6VVqR [...] franz.ak.mind.de (Andreas J. Koenig)
Download (untitled) / with headers
text/plain 593b
Show quoted text
>>>>> On Sat, 20 Sep 2008 16:52:02 -0600, karl williamson <contact@khwilliamson.com> said:
Show quoted text
> I'm the person who submitted this bug report. I think this bug should > be fixed in Perl 5, and I'm volunteering to do it. Towards that end, I > downloaded the Perl 5.10 source and hacked up an experimental version > that seems to fix it. And now I've joined this list to see how to > proceed. I don't know the protocol involved, so I'll just jump in, and > hopefully that will be all right.
Thank you! As for the protocol: do not patch 5.10, patch bleadperl instead. -- andreas
CC: "Perl5 Porters" <perl5-porters [...] perl.org>
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Mon, 22 Sep 2008 15:01:21 +0200
To: "karl williamson" <contact [...] khwilliamson.com>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 3.5k
2008/9/21 karl williamson <contact@khwilliamson.com>: Show quoted text
> The crux of the problem is that on non-EBCDIC machines, in the absence > of locale, in order to have meaningful semantics, a character (or code > point) has to be stored in utf8, except in pattern matching the \h, \H, \v > and \V or any of the \p{} patterns. (This leads to an anomaly with the > no-break space which is considered to be horizontal space (\h), but not > space (\s).) (The characters also always have base semantics of having an > ordinal number, and also of being not-a-anything (meaning that they all > pattern match \W, \D, \S, [[:^punct]], etc.)) > > Perl stores characters as utf8 automatically if a string contains any > code points above 255, and it is trivially true for ascii code points. > That leaves a hole-in-the-doughnut of characters between 128 and 255 > with behavior that varies depending on whether they are stored as utf8 > or not. This is contrary, for example, to the Camel book: "character > semantics are preserved at an abstract level regardless of > representation" (p.403). (How they get stored depends on how they were > input, or whether or not they are part of a longer string containing > code points larger than 255, or if they have been explicitly set by > using utf8::upgrade or utf8::downgrade.) > > I know of three areas where this leads to problems. > > The first is the pattern matching already alluded to. This is at least > documented (though somewhat confusingly). And one can use the \p{} > constructs to avoid the issue. > > The second is case changing functions, like lcfirst() or \U in pattern > substitutions. > > And the third is ignoring case in pattern matches. > > There may be others which I haven't looked for yet. I think, for > example, that quotemeta() will escape all these characters, though I > don't believe that this causes a real problem.
This is a good summary of the issues. Show quoted text
> One response I got to my bug report was that a lot of code depends on > things working the way they currently do. I'm wondering if that applies > to all three of the areas, or just the first?
In general, one finds that people write code relying on almost anything... Show quoted text
> Also, from reading the perl source, it appears to me that EBCDIC > machines may work differently (and more correctly to my way of thinking) > than Ascii-ish ones.
That's in theory probable, but we don't have testers on EBCDIC machines those days... Show quoted text
> An idea I've had is to add a pragma like "use latin1", or maybe "use > locale unicode", or something else as a way of not breaking existing > application code.
I think that the current Unicode bugs are annoying enough to deserve an incompatible change in perl 5.12. However, for perl 5.10.x, something could be added to switch to a more correct behaviour, if possible without slowing everything down... Show quoted text
> Anyway, I'm hoping to get some sort of fix in for this. In my > experimental implementation (which currently doesn't change EBCDIC > handling), it is mostly just extending the existing definitions of ascii > semantics to include the 128..255 latin1 range. Code logic changes were > required only in the uc and ucfirst functions (to accommodate 3 > characters which require special handling), and in the regular > expression compilation (to accommodate 2 characters which need special > handling). Obviously, in my ignorance, I may be missing things that > others can enlighten me on. > > So I'd like to know how to proceed
If you're a git user, you can work on a branch cloned from git://perl5.git.perl.org/perl.git Do not hesitate to ask questions here.
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Mon, 22 Sep 2008 21:53:52 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 780b
Moritz Lenz skribis 2008-08-21 9:50 (+0200): Show quoted text
> This is a known bug, and probably not fixable, because too much code > depends on it.
It is fixable, and the backwards incompatibility has already been announced in perl5100delta: | The handling of Unicode still is unclean in several places, where it's | dependent on whether a string is internally flagged as UTF-8. This will | be made more consistent in perl 5.12, but that won't be possible without | a certain amount of backwards incompatibility. It will be fixed, and it's wonderful to have a volunteer for that! -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Mon, 22 Sep 2008 21:55:23 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 570b
Dr.Ruud skribis 2008-08-21 13:22 (+0200): Show quoted text
> Unicode semantics get activated when a codepoint above 255 is involved.
No, unicode semantics get activated when the internal encoding of the string is utf8, even if it contains no character above 255, and even if it only contains ASCII characters. It's a bug. A known and old bug, but it must be fixed some time. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Mon, 22 Sep 2008 22:05:56 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 2.1k
karl williamson skribis 2008-09-20 16:52 (-0600): Show quoted text
> One response I got to my bug report was that a lot of code depends on > things working the way they currently do. I'm wondering if that applies > to all three of the areas, or just the first?
All three, but rest assured that this has already been discussed in great detail, and that the pumpking's decision was that backwards incompatibility would be better than keeping the bug. This decision is clearly reflected in perl5100delta: | The handling of Unicode still is unclean in several places, where it's | dependent on whether a string is internally flagged as UTF-8. This will | be made more consistent in perl 5.12, but that won't be possible without | a certain amount of backwards incompatibility. Please proceed with fixing the bug. I am very happy with your offer to smash this one. Show quoted text
> Also, from reading the perl source, it appears to me that EBCDIC > machines may work differently (and more correctly to my way of thinking) > than Ascii-ish ones.
As always, I refrain from thinking about EBCDIC. I'd say: keep the current behavior for EBCDIC platforms - there haven't been *any* complaints from them as far as I've heard. Show quoted text
> An idea I've had is to add a pragma like "use latin1", or maybe "use > locale unicode", or something else as a way of not breaking existing > application code.
Please do break existing code, harsh as that may be. It is much more likely that broken code magically starts working correctly, by the way. Pragmas have problems, especially in regular expressions. And it's very hard to load a pragma conditionally, which makes writing version portable code hard. Besides that, any pragma affecting regex matches needs to be carried in qr//, which in this case means new regex flags to indicate the behavior for (?i:...). According to dmq, adding flags is hard. Show quoted text
> Obviously, in my ignorance, I may be missing things that others can > enlighten me on.
Please feel free to copy the unit tests in Unicode::Semantics! -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Mon, 22 Sep 2008 22:09:34 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 722b
Glenn Linderman skribis 2008-09-20 16:31 (-0700): Show quoted text
> For compatibility reasons, as has been discussed on this list > previously, a pragma of some sort must be used to request the > incompatible enhancement (which you call a fix).
As the current behavior is a bug, the enhancement can rightfully be called a fix. What's this about the pragma that "must be used"? Yes, it has been discussed, but no consensus has pointed in that direction. In fact, perl5100delta clearly announces backwards incompatibility. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit setchars with no locale
Date: Tue, 23 Sep 2008 01:18:08 +0200
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 560b
Juerd Waalboer schreef: Show quoted text
> Dr.Ruud:
Show quoted text
>> Unicode semantics get activated when a codepoint above 255 is >> involved.
> > No, unicode semantics get activated when the internal encoding of the > string is utf8, even if it contains no character above 255, and even > if it only contains ASCII characters.
Yes, Unicode semantics get activated when a codepoint above 255 is involved. Yes, there are other ways too, like: perl -Mstrict -Mwarnings -Mencoding=utf8 -le' my $s = chr(65); print utf8::is_utf8($s); ' 1 -- Affijn, Ruud "Gewoon is een tijger."
CC: "Perl5 Porters" <perl5-porters [...] perl.org>
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Tue, 23 Sep 2008 03:32:12 -0400
To: "karl williamson" <contact [...] khwilliamson.com>
From: "Eric Brine" <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 471b
On Sat, Sep 20, 2008 at 6:52 PM, karl williamson <contact@khwilliamson.com>wrote: Show quoted text
> There may be others which I haven't looked for yet. I think, for > example, that quotemeta() will escape all these characters, though I > don't believe that this causes a real problem. >
There are inconsistencies with quotemeta (and therefore \Q) Show quoted text
>perl -wle"utf8::downgrade( $x = chr(130) ); print quotemeta $x"
Show quoted text
>perl -wle"utf8::upgrade( $x = chr(130) ); print quotemeta $x"
é
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Tue, 23 Sep 2008 17:03:35 +0100
To: Juerd Waalboer <juerd [...] convolution.nl>
From: Dave Mitchell <davem [...] iabyn.com>
Download (untitled) / with headers
text/plain 1.3k
On Mon, Sep 22, 2008 at 09:55:23PM +0200, Juerd Waalboer wrote: Show quoted text
> It's a bug. A known and old bug, but it must be fixed some time.
Here's a general suggestion related to fixing Unicode-related issues. A well-known issue is that the SVf_UTF8 flag means two different things: 1) whether the 'sequence of integers' are stored one per byte, or use the variable-length utf-8 encoding scheme; 2) what semantics apply to that sequence of integers. We also have various bodges, such as attaching magic to cache utf8 indexes. All this stems from the fact that there's no space in an SV to store all the information we want. So.... How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Extended String flag. This flag indicates that prepended to the SvPVX string is an auxilliary structure (cf the hv_aux struct) that contains all the extra needed unicodish info, such as encoding, charset, locale, cached indexes etc etc. This then both allows us to disambiguate the meaning of SVf_UTF8 (in the aux structure there would be two different flags for the two meanings), but would also provide room for future enhancements (eg space for a UTF32 flag should someone wish to implement that storage format). Just a thought... -- "I do not resent criticism, even when, for the sake of emphasis, it parts for the time with reality". -- Winston Churchill, House of Commons, 22nd Jan 1941.
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Tue, 23 Sep 2008 18:58:16 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 719b
Dave Mitchell skribis 2008-09-23 17:03 (+0100): Show quoted text
> How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an > Extended String flag. This flag indicates that prepended to the SvPVX > string is an auxilliary structure (cf the hv_aux struct) that contains all the > extra needed unicodish info, such as encoding, charset, locale, cached > indexes etc etc.
It sounds rather complicated, whereas the current plan would be to continue with the single bit flag, and only remove one of its meanings. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Tue, 23 Sep 2008 11:01:11 -0700
To: Juerd Waalboer <juerd [...] convolution.nl>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 4.1k
On approximately 9/23/2008 9:58 AM, came the following characters from the keyboard of Juerd Waalboer: Show quoted text
> Dave Mitchell skribis 2008-09-23 17:03 (+0100):
>> How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an >> Extended String flag. This flag indicates that prepended to the SvPVX >> string is an auxilliary structure (cf the hv_aux struct) that contains all the >> extra needed unicodish info, such as encoding, charset, locale, cached >> indexes etc etc.
It is not at all clear to me that encoding, charset, and locale are Unicodish info... Unicode frees us from such stuff, except at boundary conditions, where we must deal with devices or formats that have limitations. This extra information seems more appropriately bound to file/device handles than to strings. Cached indexes are a nice performance help, I don't know enough about the internals to know if reworking them from being done as magic, to being done in some frightfully (in thinking of XS) new structure would be an overall win or loss. Show quoted text
> It sounds rather complicated, whereas the current plan would be to > continue with the single bit flag, and only remove one of its meanings.
I guess Juerd is referring to removing any semantic meaning of the flag, and leaving it to simply be a representational flag? That representational flag would indicate that the structure of the string is single-byte oriented (no individual characters exceed a numeric value of 255), or multi-bytes oriented (characters may exceed a numeric value of 255, and characters greater than a numeric value of 127 will be stored in multiple, sequential bytes). After such a removal, present-perl would reach the idyllic state (idyllic-perl) of implementing only Unicode semantics for all string operations. (Even the EBCDIC port should reach that idyllic state, although it would use a different encoding of numbers to characters, UTF-EBCDIC instead of UTF-8.) If other encodings are desired/used, there would be two application approaches to dealing with it: 1) convert all other encodings to Unicode, perform semantic operations as needed, convert the results to some other encoding. This is already the recommended approach, although present-perl's attempt to retain the single-byte oriented representational format as much as possible presently makes this a bit tricky. 2) leave data in other encodings, but avoid the use of Perl operations that apply Unicode semantics in manners that are inconsistent with the semantics of the other encoding. Write specific code to implement the semantics of the other encoding as needed, without doing the re-coding. This could be somewhat error prone, but could be achieved, since, after all, strings are simply an ordered list of numbers, to which any application semantics that are desired can be applied. Idyllic-perl simply provides a fairly large collection of string operations that have Unicode semantics, which are inappropriate for use with strings having other semantics. Note that binary data in strings is simply a special case of strings with non-Unicode semantics... In present-perl, there are three sets of string semantics selected by the representation, ASCII (operations like character classes and case shifting), Latin-1 (the only operation that supports Latin-1 semantics is the conversion from single-byte representation to multi-byte representation), and Unicode (operations like character classes and case shifting). It is already inappropriate to apply operations that imply ASCII or Unicode semantics to binary strings of either representation. Applying the representation conversion operation to binary data is perfectly legal, and doesn't change the binary values in any way... but is generally not a mental shift that most programmers wish to make in dealing with binary data--most prefer their binary data to remain in the single-byte oriented representation, and they are welcome to code in such a manner that they do. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Tue, 23 Sep 2008 12:18:12 -0600
From: karl williamson <contact [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.4k
Glenn, The reason I called it a bug is that I, an experienced Perl programmer, attempted to enhance an application to understand unicode. I read the Camel book and the on-line documentation and came to a very different expectation as to how it worked than it does in reality. I then thought I was scouring the documentation when things went wrong, and still didn't get it. It was only after a lot of experimentation and some internet searches that I started to see the cause of the problem. I was using 5.8.8; perhaps the documentation has changed in 5.10. And perhaps my own expectations of how I thought it should work caused me to be blind to things in the documentation that were contrary to my preconceived notions. Whatever one calls it, there does seem to be some support for changing the behavior. After reading your response and further reflection, I think that Goal #1 of not breaking old programs is contradictory to the other ones. Indeed, a few regression tests fail with my experimental implementation. Some of them are commented that they are taking advantage of the anomaly to verify that the operation they test doesn't change the utf8-ness of the data. Others explicitly are testing that, for example, taking lc(E with an accent) returns itself unless an appropriate locale is specified. I doubt that the code that test was for really cares, but if so, why put in the test? There are a couple of failures which are obtuse, and uncommented, so I haven't yet tried to figure out what was going on. I wanted to see if I should proceed at all before doing so. I have looked in the archive and found some discussions about this problem, but certainly not a lot. Please let me know of ones you think important that I read. Karl Williamson Glenn Linderman wrote: Show quoted text
> > For compatibility reasons, as has been discussed on this list > previously, a pragma of some sort must be used to request the > incompatible enhancement (which you call a fix). > > N.B. There are lots of discussions about it in the archive, some > recently, if you haven't found them, you should; if you find it hard to > find them, ask, and I (or someone) will try to find the starting points > for you, perhaps the summaries would be a good place to look to find the > discussions; I participated in most of them. > > Those discussions are lengthy reading, unfortunately, but they do point > out an extensive list of issues, perhaps approaching completeness. >
Subject: Re: Volunteer for fixing [perl #58182], the "Unicode" bug
Date: Tue, 23 Sep 2008 15:01:50 -0700
To: karl williamson <contact [...] khwilliamson.com>, perl5 porters <perl5-porters [...] perl.org>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 7.7k
On approximately 9/23/2008 10:33 AM, came the following characters from the keyboard of karl williamson: Show quoted text
> Glenn, > > The reason I called it a bug is that I, an experienced Perl > programmer, attempted to enhance an application to understand > unicode. I read the Camel book and the on-line documentation and came > to a very different expectation as to how it worked than it does in > reality.
The behavior is non-obvious. I may be blind to the deficiencies of the documentation, because of knowing roughly how it works, due to hanging out on p5p too long :) It has been an open discussion about whether it is working as designed (with lots of gotchas for the application programmer), or whether the design, in fact, is the bug. Seems Raphael has declared it to be a bug in the 5.10 release notes, and something that can/should be incompatibly changed/fixed for 5.12, but I missed that declaration. Any solution for 5.8.x or 5.10.x, though, would have be treated as an enhancement, turned on by a pragma, because the current design, buggy or not, is the current design for which applications are coded. Show quoted text
> I then thought I was scouring the documentation when things went > wrong, and still didn't get it. It was only after a lot of > experimentation and some internet searches that I started to see the > cause of the problem. I was using 5.8.8; perhaps the documentation > has changed in 5.10. And perhaps my own expectations of how I thought > it should work caused me to be blind to things in the documentation > that were contrary to my preconceived notions.
The documentation has been in as much flux as the code, from 5.6.x to 5.8.x to 5.10.x. Unfortunately, there are enough warts in the design that it is hard to find all the places where the documentation should be clarified. My most recent message to p5p clarifies what I think is the idyllic state that I hope is the one that you share, and will achieve for 5.12 (or for a pragma-enabled 5.10) redesign/bug-fix. Show quoted text
> Whatever one calls it, there does seem to be some support for changing > the behavior. After reading your response and further reflection, I > think that Goal #1 of not breaking old programs is contradictory to > the other ones.
Yes, there is a definite conflict between those goals, and from that conflict arises many of the behaviours that are not expected by reasonable programmers when designing their application code. Show quoted text
> Indeed, a few regression tests fail with my experimental > implementation. Some of them are commented that they are taking > advantage of the anomaly to verify that the operation they test > doesn't change the utf8-ness of the data. Others explicitly are > testing that, for example, taking lc(E with an accent) returns itself > unless an appropriate locale is specified. I doubt that the code that > test was for really cares, but if so, why put in the test? There are > a couple of failures which are obtuse, and uncommented, so I haven't > yet tried to figure out what was going on. I wanted to see if I > should proceed at all before doing so.
Sure. Please proceed. Especially with Raphael's openness to incompatible changes in this area for 5.12, it would be possible to remove all of the warts, conflicts, and unexpected behaviours. Of course, incompatible changes are always considered for major releases, but not always accepted. But this area seems to have a green light. The current situation is very painful, compared to other languages that implement Unicode. The compatibility issue was very real, however, when the original design was done, no doubt partly due to Perl's extensive CPAN collection, particularly the XS part of that CPAN collection. Some of that concern has been alleviated due to enhancements to the XS code in the intervening years, although no doubt you may encounter bugs in some of those enhancements, also. Show quoted text
> I have looked in the archive and found some discussions about this > problem, but certainly not a lot. Please let me know of ones you > think important that I read.
The discussions are more lengthy (per post, and per number of posts), than numerous (by thread count)... and contain more heat than light, often. Perhaps you've found them all. Given Raphael's green light, and if you are pointed at changes to Perl 5.12, the most important thing is to cover all the relevant operations, so that all string operations apply Unicode semantics to all their operands, regardless of their representational format. Here is one thread: Subject: "on the almost impossibility to write correct XS modules" started by Marc Lehmann on April 25, 2008, and lasted with that subject line until at least May 22! So almost a whole month! demerphq spawned a related thread subject: "On the problem of strings and binary data in Perl." on May 20, 2008. This attempted to deal with multi-lingual strings; there is more to the issue of proper handling of multi-lingual strings than being able to represent all the characters that each one uses, but that is a very specialized type of program; at least being able to represent the characters is a good start; being able to pass "language" as an operand to certain semantic operations would be good (implicitly, via locale, or explicitly, via a parameter). Another related issue is that various operations that attempt to implement Unicode semantics don't go the whole way, and have interesting semantics for when strings (even strings represented in multi-byte format) don't actually contain Unicode. Idyllic-perl should have chr/ord as simple ways to convert between numbers and characters, and not burden them with any sort of Unicode semantics. See bug #51936, and the p5p thread it spawned (search for the bug number in the archives). See also bug #51710 and the threads it spawned, about utf8_valid. While utf8_valid probably should be enhanced, its existence is probably reasonable justification to not burden chr/ord with Unicode validity checks. Let's not forget Pack & Unpack. There's one thread about that with Subject: Perl 5.8 and perl 5.10 differences on UTF/Pack things started in June 18, 2008... and a much older one started by Marc Lehmann (not sure what that subject line was, but it resulted in a fairly recent change to Pack, by Marc). Other related threads have the following subject lines: use encoding 'utf8' bug for Latin-1 range proposed change in utf8 filename semantics Compress::Zlib, pack "C" and utf-8 Smack! (this spawned some other threads that left Smack! in their subjects, but which were added to) perl, the data, and the tf8 flag the utf8 flag encoding neutral unpack The philosophy should be that no Perl operations should have different semantics based on the representation of the string being single-byte or multi-byte format. Operations and places to watch out for include (one of these threads attempted a complete list of operations that had different semantics, this is my memory of some of them): String constant metacharacters such as \u \U \l \L case shifting code such as uc & lc regexp case insensitivity and character classes chr/ord utf8::is_valid pack/unpack - packing should always produce a single-byte string, and unpack should generally expect a single-byte string... but if, for some reason, unpack is handed a multi-byte string, it should not pretend it really should have been a single-byte string, but instead, it should interpret the string as input characters. If there are any input characters actually greater than 255, this should probably be considered a bug, because pack doesn't produce such. Perhaps Marc's fix was the last issue along that line for unpack... -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: "Juerd Waalboer" <juerd [...] convolution.nl>, perl5-porters [...] perl.org
Subject: Re: [perl #58182] Inconsistent and wrong handling of 8th bit set chars with no locale
Date: Fri, 26 Sep 2008 12:11:13 +0200
To: "Dave Mitchell" <davem [...] iabyn.com>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 1022b
2008/9/23 Dave Mitchell <davem@iabyn.com>: Show quoted text
> How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an > Extended String flag. This flag indicates that prepended to the SvPVX > string is an auxilliary structure (cf the hv_aux struct) that contains all the > extra needed unicodish info, such as encoding, charset, locale, cached > indexes etc etc.
I don't think we want to store the charset/locale with the string. Consider the string "istanbul". If you're treating this string as English, you'll capitalize it as "ISTANBUL", but if you want to follow the Stambouliot spelling, it's "İSTANBUL". Now consider the string "Consider the string "istanbul"". Shall we capitalize it as "CONSİDER THE STRİNG "İSTANBUL"" ? Obviously attaching a language to a string is going to be a problem when you have to handle multi-language strings. So the place that makes sense to provide this information is, in my opinion, in lc and uc (and derivatives): in the code, not the data. (So a pragma can be used, too.)
Subject: Re: [perl #58182] Unicode problem
Date: Fri, 26 Sep 2008 12:44:31 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <contact [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.4k
I have been studying some of the discussions in this group about this problem, and find them overwhelming. So, I'm going to just put forth a simple straw proposal that doesn't address a number of the things that people were talking about, but does solve a lot of things. This is a very concrete proposal, and I would like to get agreement on the semantics involved: There will be a new mode of operation which will be enabled or disabled by means yet to be decided. When enabled, the new behavior will be that a character in a scalar or pattern will have the same semantics whether or not it is stored as utf8. The operations that are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and patten matching, (including \U, \u, \L, and \l, and matching of things like \w, [[:punct:]]). This is effectively what would happen if we were operating under an iso-8859-1 locale with the following modifications to get full unicode semantics: 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in pattern substitutions. The result will be in utf8, since the capital letter is above 0xff. 2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for ucfirst, and \U and \u in pattern substitutions. The result will be in utf8, since the capital letter is above 0xff. 3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of LATIN CAPITAL LETTER S followed by itself; ie, 'SS'. Same for \U in pattern substitutions. The result will have the utf8-ness as the original. 4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting of the two characters LATIN CAPITAL LETTER S followed by LATIN SMALL LETTER S; ie, 'Ss'. Same for \u in pattern substitutions. The result will have the utf8-ness as the original. 5) If the MICRO SIGN is in a pattern with case ignored, it will match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL LETTER MU. 6) If the LATIN SMALL LETTER SHARP S is in a pattern with case ignored, it will match itself and any of 'SS', 'Ss', 'ss'. 7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with case ignored, it will match itself and LATIN CAPITAL LETTER Y WITH DIAERESIS This mode would not impose a compile-time latin1-like locale on the perl program. For example, whether perl identifiers could have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would not be affected by this mode I do not propose to automatically convert ("downgrade") strings from utf8 to latin1 when utf8 is not needed. For example, lc(LATIN CAPITAL LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding I don't know what to do about EBCDIC machines. I propose leaving them to work the way they currently do. I don't know what to do about interacting with "use bytes". One option is for them to be mutually incompatible, that is, if you turn one on, it turns the other off. Another option is if both are in effect that it would be exactly the same as if a latin1 run-time locale was set, without any of the modifications listed above. Are there other interactions that we need to worry about? I would like to defer how this mode gets enabled or disabled until we agree on the semantics of what happens when it is enabled. I think that a number of the issues that have been raised in the past are in some way independent of this proposal. We may want to do some of them, but should we do at least this much, or not?
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode problem
Date: Sat, 27 Sep 2008 02:45:26 +0400
To: karl williamson <contact [...] khwilliamson.com>
From: Vadim Konovalov <vadim [...] vkonovalov.ru>
Download (untitled) / with headers
text/plain 1.8k
karl williamson wrote: Show quoted text
> I have been studying some of the discussions in this group about this > problem, and find them overwhelming. So, I'm going to just put forth > a simple straw proposal that doesn't address a number of the things > that people were talking about, but does solve a lot of things. > > This is a very concrete proposal, and I would like to get agreement on > the semantics involved: > There will be a new mode of operation which will be enabled or > disabled by means yet to be decided. When enabled, the new behavior > will be that a character in a scalar or pattern will have the same > semantics whether or not it is stored as utf8. The operations that > are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and > patten matching, (including \U, \u, \L, and \l, and matching of > things like \w, [[:punct:]]). This is effectively what would happen > if we were operating under an iso-8859-1 locale
what the "under an iso-8859-1 locale" exactly? reading perllocale gives me: USING LOCALES The use locale pragma By default, Perl ignores the current locale. The "use locale" pragma tells Perl to use the current locale for some operations: Do I understand correctly that your proposal will never touch me provided that I never do "use locale;"? You do not mean posix locale, don't you? Do I remember correctly that using locales is not recommended in Perl? Show quoted text
> with the following > modifications to get full unicode semantics: > 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL > LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in > pattern substitutions. The result will be in utf8, since the > capital letter is above 0xff.
could you please be more precise with uc(blablabal)? what you currently wrote is a syntax error Show quoted text
> .... >
Best regards, Vadim.
Subject: Re: [perl #58182] Unicode problem
Date: Fri, 26 Sep 2008 13:32:49 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <contact [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.3k
What I meant is not a literal locale, but that the semantics would be the same as iso-8859-1 characters but with the listed modifications. I was trying to avoid listing all the 8859-1 semantics. But in brief, there are 128 characters above ascii in 8859-1, and they each have semantics. 0xC0 for example is a latin capital letter A with a grave accent. Its lower case is 0xE0. If you are on a Un*x like system, you can type 'man latin1' at a command line prompt to get the entire list. It doesn't however say which things are punctuation, which are word characters, etc. But they are the same in Unicode, so the Unicode standard lists all of them. Characters that are listed in the man page that are marked as capital all have corresponding lower case versions that are easy to figure out by their names. The three characters I mentioned as modifications to get unicode are considered lower case and have either multiple character upper case versions, or their upper case version is not in latin1 My proposal would touch you UNLESS you do have a 'use locale'. Your locale would override my proposal. In other words, by specifying "use locale", my proposal would not touch your program. The documentation does say not to use locales, but in looking at the code, it appears to me that a locale takes precedence, and does work ok. I believe that you can get many of the Perl glitches to go away by having a locale which specifies iso-8859-1. But I haven't actually tried it. Vadim Konovalov wrote: Show quoted text
> karl williamson wrote:
>> I have been studying some of the discussions in this group about this >> problem, and find them overwhelming. So, I'm going to just put forth >> a simple straw proposal that doesn't address a number of the things >> that people were talking about, but does solve a lot of things. >> >> This is a very concrete proposal, and I would like to get agreement on >> the semantics involved: >> There will be a new mode of operation which will be enabled or >> disabled by means yet to be decided. When enabled, the new behavior >> will be that a character in a scalar or pattern will have the same >> semantics whether or not it is stored as utf8. The operations that >> are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and >> patten matching, (including \U, \u, \L, and \l, and matching of >> things like \w, [[:punct:]]). This is effectively what would happen >> if we were operating under an iso-8859-1 locale
> > what the "under an iso-8859-1 locale" exactly? > > reading perllocale gives me: > > USING LOCALES > The use locale pragma > > By default, Perl ignores the current locale. The "use locale" > pragma tells Perl to > use the current locale for some operations: > > Do I understand correctly that your proposal will never touch me > provided that I never do "use locale;"? > You do not mean posix locale, don't you? > > Do I remember correctly that using locales is not recommended in Perl? >
>> with the following >> modifications to get full unicode semantics: >> 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL >> LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in >> pattern substitutions. The result will be in utf8, since the >> capital letter is above 0xff.
> > could you please be more precise with uc(blablabal)? > > what you currently wrote is a syntax error >
>> .... >>
> > Best regards, > Vadim. > >
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode problem
Date: Fri, 26 Sep 2008 14:00:16 -0700
To: karl williamson <contact [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 7.6k
On approximately 9/26/2008 11:44 AM, came the following characters from the keyboard of karl williamson: Show quoted text
> I have been studying some of the discussions in this group about this > problem, and find them overwhelming. So, I'm going to just put forth a > simple straw proposal that doesn't address a number of the things that > people were talking about, but does solve a lot of things.
Yeah, I gave you a lot of reading material. I hoped not to scare you off, but I didn't want you be ignorant of the previous discussions, do a bunch of work that only solved part of the problems, and have it rejected because it wasn't a complete solution. Show quoted text
> This is a very concrete proposal, and I would like to get agreement on > the semantics involved: > There will be a new mode of operation which will be enabled or > disabled by means yet to be decided.
This makes it sound like you are targeting 5.10.x; since you are talking about modes of operation. On the other hand, if the implementation isn't significantly more complex than current code, keeping the current behavior might be a safe approach, even if somewhere, somehow, the new behavior decides to become the default behavior. Show quoted text
> When enabled, the new behavior > will be that a character in a scalar or pattern will have the same > semantics whether or not it is stored as utf8. The operations that > are affected are lc(), lcfirst(), uc(), ucfirst(), quotemeta() and > patten matching, (including \U, \u, \L, and \l, and matching of > things like \w, [[:punct:]]).
This sounds like it might be a complete list of operations. I think \u, \U, \l, and \L are more string interpolation operators rather than pattern matching operators, but that is just terminology. Show quoted text
> This is effectively what would happen > if we were operating under an iso-8859-1 locale with the following > modifications to get full unicode semantics: > 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL > LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in > pattern substitutions. The result will be in utf8, since the > capital letter is above 0xff. > 2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for > ucfirst, and \U and \u in pattern substitutions. The result > will be in utf8, since the capital letter is above 0xff. > 3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of > LATIN CAPITAL LETTER S followed by itself; ie, 'SS'. Same for > \U in pattern substitutions. The result will have the utf8-ness > as the original. > 4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting > of the two characters LATIN CAPITAL LETTER S followed by LATIN > SMALL LETTER S; ie, 'Ss'. Same for \u in pattern substitutions. > The result will have the utf8-ness as the original. > 5) If the MICRO SIGN is in a pattern with case ignored, it will > match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL > LETTER MU. > 6) If the LATIN SMALL LETTER SHARP S is in a pattern with case > ignored, it will match itself and any of 'SS', 'Ss', 'ss'. > 7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with > case ignored, it will match itself and LATIN CAPITAL LETTER Y > WITH DIAERESIS > > This mode would not impose a compile-time latin1-like locale on > the perl program. For example, whether perl identifiers could > have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would > not be affected by this mode
These all sound like appropriate behaviors to implement for a Unicode semantics mode. However, I wouldn't know (or particularly care) if it is a complete of differences between Latin-1 and Unicode semantics. I'm not at all interested in Latin-1 semantics. Today, the operators you list all have ASCII semantics, most everyone seems to agree that Unicode semantics would be preferred. Latin-1 semantics are only used in upgrade/downgrade operations, at present. (Unless someone says use locale; which, as you say, is not recommended to use locales.) Show quoted text
> I do not propose to automatically convert ("downgrade") strings from > utf8 to latin1 when utf8 is not needed. For example, lc(LATIN CAPITAL > LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding
Fine. All else being equal (utf8 just being a representation) it shouldn't make any difference. Show quoted text
> I don't know what to do about EBCDIC machines. I propose leaving > them to work the way they currently do.
Best effort non-breakage seems to be the best we can currently expect... Show quoted text
> I don't know what to do about interacting with "use bytes". One > option is for them to be mutually incompatible, that is, if you > turn one on, it turns the other off. Another option is if both > are in effect that it would be exactly the same as if a latin1 > run-time locale was set, without any of the modifications listed > above.
Another possibility would be that all the above listed operations would be noops or produce errors, because they all imply Unicode character semantics, whereas use bytes declares that the data is binary. "\U\x45\x23\x37" should just be "\x45\x23\x37" for example of a noop. Show quoted text
> Are there other interactions that we need to worry about?
Probably. Every XS writer under the sun has assumed different things about utf8 flag semantics, I'm sure. So you should worry about handling the flakkk. Show quoted text
> I would like to defer how this mode gets enabled or disabled until we > agree on the semantics of what happens when it is enabled.
Sure, but if you target 5.10.x you need some way of enabling or disabling. If you target 5.12, enabling may happen because it is 5.12. Show quoted text
> I think that a number of the issues that have been raised in the past > are in some way independent of this proposal. We may want to do some of > them, but should we do at least this much, or not?
It might be nice to recap anything that isn't being addressed, at least in general terms, so that someone doesn't "remember" it at the last minute, and claim that your proposal is worthless without a solution in that area. Unicode filename handling, especially on Windows, might be a contentious point, as it is also basically broken. In fact, once Perl has Unicode semantics for all strings, then it would be basically appropriate for the Windows port to start using the "wide" UTF-16 APIs, instead of the the "byte" APIs for all OS API calls. This might be a fair-size bullet to chew on, but it would be extremely useful; today, it is extremely difficult to write multilingual programs using Perl on Windows, and the biggest culprit is the use of the 8-bit APIs, with the _UNICODE (I think) define not being turned on when compiling perl and extensions. Enough that I have had to learn Python for a recent project. In large part, one could claim that this is a Windows port issue, not a core perl issue, of course... there is no reason that the Windows port couldn't have already starting using wide APIs, even with the limited Unicode support in perl proper... everyone (here at least) knows the kludges to use to get perl proper to use Unicode consistently enough to get work done, but the I/O boundary on Windows is a real problem. You'll need to give this proposal a week or so of discussion time before you can be sure that everyone that cares has commented, or longer (perhaps much longer) if there is dissension. However, I think a lot of the dissension has been beaten out in earlier discussions, so perhaps the time is ripe that a fresh voice with motivation to make some fixes, can actually make progress on this topic. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Subject: Re: [perl #58182] Unicode problem
Date: Sat, 27 Sep 2008 00:29:54 +0200
To: perl5-porters [...] perl.org
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 1.5k
Hello Karl, I strongly agree with your proposed solutions. (I'm ambivalent only about the 4th: ucfirst "ß".) Thank you for the summary. karl williamson skribis 2008-09-26 12:44 (-0600): Show quoted text
> 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL > LETTER Y WITH DIAERESIS. Same for ucfirst, and \U and \u in > pattern substitutions. The result will be in utf8, since the > capital letter is above 0xff.
"in utf8" is ambiguous. It can mean either length(uc($y_umlaut)) == 2 or is_utf8(uc($y_umlaut)). The former would be wrong, the latter would be correct. May I suggest including the words "upgrade" and "internal"? The resulting string will be upgraded to utf8 internally, ... Show quoted text
> I don't know what to do about interacting with "use bytes". One > option is for them to be mutually incompatible, that is, if you > turn one on, it turns the other off. > (...) > I would like to defer how this mode gets enabled or disabled until we > agree on the semantics of what happens when it is enabled.
Turning your solutions on explicitly is probably wrong, at least for 5.12. Using a pragma is problematic because of qr//, and because it cannot be enabled conditionally (in any reasonably easy way). I'd prefer to skip any discussion about how to enable or disable this - enable it by default and don't provide any way to disable it. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl> 1;
Subject: Re: [perl #58182] Unicode problem
Date: Sat, 27 Sep 2008 12:45:24 +0200
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 230b
karl williamson schreef: Show quoted text
> I would like to defer how this mode gets enabled or disabled > until we agree on the semantics of what happens when it is > enabled.
use kurila; # ;-) -- Affijn, Ruud "Gewoon is een tijger."
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Sun, 28 Sep 2008 00:49:45 +0400
To: "Dr.Ruud" <rvtol+news [...] isolution.nl>
From: Vadim Konovalov <vadim [...] vkonovalov.ru>
Download (untitled) / with headers
text/plain 451b
Dr.Ruud wrote: Show quoted text
> karl williamson schreef: >
>> I would like to defer how this mode gets enabled or disabled >> until we agree on the semantics of what happens when it is >> enabled. >>
> > use kurila; # ;-) > >
kurila is so largely incompatible, it is even off-topicable! (initially I thought its on-topic but then I was convinced by responders it isn't and looking at direction it go it really is not on-topic on p5p) BR, Vadim.
Subject: Re: [perl #58182] Unicode problem
Date: Mon, 06 Oct 2008 21:02:36 -0600
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.2k
My proposal from a week and a half ago hasn't spawned much dissension--yet. I'll take that as a good sign, and proceed. Here's a hodge-podge of my thoughts about it, but most important, I am concerned about the enabling and disabling of this. I think there has to be some way to disable it in case current code has come to rely on what I call broken behavior. It looks like in 5.12, Rafael wants the new mode to be default behavior. But he also said that a switch could be added in 5.10.x to turn it on, as long as performance doesn't suffer. Glenn, "use bytes" doesn't mean necessarily binary. For example, use bytes; print lc('A'), "\n"; prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings. If there is a way to en/dis-able this mode, doesn't that have to be a pragma? Doesn't it have to be lexically scoped? And if the answers to these are yes, what do we do with things that are created under one mode and then executed in the other? Juerd wrote: ==== Pragmas have problems, especially in regular expressions. And it's very hard to load a pragma conditionally, which makes writing version portable code hard. Besides that, any pragma affecting regex matches needs to be carried in qr//, which in this case means new regex flags to indicate the behavior for (?i:...). According to dmq, adding flags is hard. ==== I don't understand what you mean that pragmas have problems, esp in re's. Please explain. I had thought I had this solved for qr//i. The way I was planning to implement this for pattern matching is quite simple. First, by changing the existing fold table definitions to include the Unicode semantics, the pattern matching magically starts working without any code logic changes for all but two characters: the German sharp ss, and the micron symbol. For these, I was planning to use the existing mechanisms to compile the re as utf8, so it wouldn't require any new flags. Thus qr// would be utf8 if it contained these two characters. And it works today to match such a pattern against both non-utf8 and utf8 strings. I haven't tested to see what happens when such a pattern is executed under use bytes. I was presuming it did something reasonable. But now I'm not so sure, as I've found a number of bugs in the re code in my testing, and some are of a nature that I don't feel comfortable with my level of knowledge about how it works to dive in and fix them. They should be fixed anyway, and I'm hoping some expert will undertake that. I think that once they're fixed, that I could extend them to work in the latin1 range quite easily. So the bottom line is that qr//i may or may not be a problem. For the other interactions, I'm not sure there is a problem. If one creates a string whether or not this mechanism is on, it remains 8 bits, unless it has a code point above 255. If one operates on it while this mechanism is on, it gets unicode semantics, which in a few cases irretrievably convert it to utf8 because the result is above 255. If one operates on it while this mechanism is off, you get ASCII semantics. I don't really see a problem with that. I think it would be easy to extend this to EBCDIC, at least the three encodings perl has compiled-in tables for. The problem is that Rafael said that there's no one testing on EBCDIC machines, so I couldn't know if it worked or not before releasing it. I'm also thinking that the Windows file name problems can be considered independent of this, and addressed at a later time. I also agree with Glenn's and Juerd's wording changes. I saw nothing in my reading of the code that would lead me to touch the utf8 flag's meaning. But I am finding weird bugs in which Perl apparently gets mixed up about the flag. These vanish if I rearrange the order of supposedly independent lines in the program. It looks like it could be a wild write. I wrote a bug report [perl #59378], but I think that the description of that is wrong. So the bottom line for now, is I'd like to get some consensus about how to turn it on and off (and whether to, which I think the answer is there has to be a way to turn it off.) I guess I would claim that in 5.12, "use bytes" could be used to turn it off. But that may be controversial, and doesn't address backporting it.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Mon, 06 Oct 2008 21:04:47 -0700
To: karl williamson <public [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 8.7k
On approximately 10/6/2008 8:02 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> My proposal from a week and a half ago hasn't spawned much > dissension--yet. I'll take that as a good sign, and proceed. > > Here's a hodge-podge of my thoughts about it, but most important, I am > concerned about the enabling and disabling of this. I think there has > to be some way to disable it in case current code has come to rely on > what I call broken behavior. > > It looks like in 5.12, Rafael wants the new mode to be default behavior. > But he also said that a switch could be added in 5.10.x to turn it on, > as long as performance doesn't suffer. > > Glenn, "use bytes" doesn't mean necessarily binary. For example, > use bytes; > print lc('A'), "\n"; > > prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings.
That interpretation could work, however, it is in conflict with the documented behavior of use bytes... use bytes is explicitly documented to work on the bytes of utf8 strings (thus making visible the individual bytes of the UTF8 encoding). While as you demonstrate, lc('A') is applied, that seems like a bug to me; the documentation says "The use bytes pragma disables character semantics". On the other hand, this documentation may simply be confusing -- it may actually mean only that utf8 strings are to be treated as bytes, like other byte strings, which may have binary or character semantics applied, depending on the operator invoked. I think it would be much more useful to prohibit operations that apply character semantics while in "use bytes" mode. chr should restrict input values to 0..255, ord will only produce such. It is already documented that substr, index, and rindex work as byte operators. regexps compiled while use bytes is in effect should not support character sorts of operations. \w is meaningless on binary data, for example, although character classes (could be called byte classes) could still be useful without character semantics. I waffle on the regexp operations... I doubt you'll find strong support for this position, due to compatibility reasons, but if we are going incompatible for Unicode support, it seems that going incompatible on bytes support isn't much harder, and could help find bugs. Of course, if you can turn it off, and regain compatibility... Show quoted text
> If there is a way to en/dis-able this mode, doesn't that have to be a > pragma? Doesn't it have to be lexically scoped? And if the answers to > these are yes, what do we do with things that are created under one mode > and then executed in the other?
For the strings themselves, I think it is reasonable to apply the semantics in which they are executed. Regexps are a harder call. Your analysis below is interesting. Show quoted text
> Juerd wrote: > ==== > Pragmas have problems, especially in regular expressions. And it's very > hard to load a pragma conditionally, which makes writing version > portable code hard. Besides that, any pragma affecting regex matches > needs to be carried in qr//, which in this case means new regex flags to > indicate the behavior for (?i:...). According to dmq, adding flags is > hard. > ==== > > I don't understand what you mean that pragmas have problems, esp in > re's. Please explain.
The compilation of a regexp may be optimized based on the semantics then in place, and may need to be able to preserve the semantics from the point of compilation to the point of use. It is certainly true that _if_ the regexp is optimized based on compilation semantics, that some definition should be made about what it means to compile it with one semantics and use it when there are different semantics, there are three choices: (1) error (2) recompile to use current semantics (3) apply the semantics from the time of compilation. I think Juerd probably assumed (3), and thus assumes that the flags need to be preserved within the regexp. Show quoted text
> I had thought I had this solved for qr//i. The way I was planning to > implement this for pattern matching is quite simple. First, by changing > the existing fold table definitions to include the Unicode semantics, > the pattern matching magically starts working without any code logic > changes for all but two characters: the German sharp ss, and the micron > symbol. For these, I was planning to use the existing mechanisms to > compile the re as utf8, so it wouldn't require any new flags. Thus qr// > would be utf8 if it contained these two characters. And it works today > to match such a pattern against both non-utf8 and utf8 strings. I > haven't tested to see what happens when such a pattern is executed under > use bytes. I was presuming it did something reasonable. But now I'm > not so sure, as I've found a number of bugs in the re code in my > testing, and some are of a nature that I don't feel comfortable with my > level of knowledge about how it works to dive in and fix them. They > should be fixed anyway, and I'm hoping some expert will undertake that. > I think that once they're fixed, that I could extend them to work in > the latin1 range quite easily. So the bottom line is that qr//i may or > may not be a problem.
Clever, and maybe it works, or could be fixed to work. I can't say otherwise. Show quoted text
> For the other interactions, I'm not sure there is a problem. If one > creates a string whether or not this mechanism is on, it remains 8 bits, > unless it has a code point above 255. If one operates on it while this > mechanism is on, it gets unicode semantics, which in a few cases > irretrievably convert it to utf8 because the result is above 255. If > one operates on it while this mechanism is off, you get ASCII semantics. > I don't really see a problem with that. > > I think it would be easy to extend this to EBCDIC, at least the three > encodings perl has compiled-in tables for. The problem is that Rafael > said that there's no one testing on EBCDIC machines, so I couldn't know > if it worked or not before releasing it.
No comment. Show quoted text
> I'm also thinking that the Windows file name problems can be considered > independent of this, and addressed at a later time.
File names are currently well defined to be bytes, in the documentation. This is, of course, extremely wrong and limiting on Windows. There are no good solutions; there is a solution of using special APIs Jan Dubois has written (thanks Jan), but it likely can be considered independently, as much as it would be nice to solve it soon. Show quoted text
> I also agree with Glenn's and Juerd's wording changes. > > I saw nothing in my reading of the code that would lead me to touch the > utf8 flag's meaning. But I am finding weird bugs in which Perl > apparently gets mixed up about the flag. These vanish if I rearrange > the order of supposedly independent lines in the program. It looks like > it could be a wild write. I wrote a bug report [perl #59378], but I > think that the description of that is wrong.
Could well be some bugs in the edge cases here. I doubt the tests provide full coverage. Plan on writing more tests, if possible, at least where you find bugs either in test code or by reading code that you are learning or changing. Show quoted text
> So the bottom line for now, is I'd like to get some consensus about how > to turn it on and off (and whether to, which I think the answer is there > has to be a way to turn it off.) I guess I would claim that in 5.12, > "use bytes" could be used to turn it off. But that may be > controversial, and doesn't address backporting it.
For the moment, let's call "this feature" "enhanced Unicode semantics", and the Unicode semantics we have today "today's Unicode semantics". "use bytes" can't turn off "enhanced Unicode semantics", because it implements its own semantics that are different that "today's Unicode semantics". In addition to the 2 Unicode semantics, one could consider that there are two other sets of semantics... "today's bytes semantics", and "Glenn's proposed bytes semantics" (which eliminate character operations during use bytes sections). If my proposal is refined/further defined/accepted, it may also have to be turned off. Doing both with one flag is probably OK, as "use bytes" and "no bytes" differentiate sections that have bytes vs Unicode semantics. So if it is a pragma, I think it has to be a different one than "use bytes", or an extension to "use bytes" (but the name "bytes" is wrong for the switch between various Unicode semantics). If it is "simply" done one way in 5.10 and the other way in 5.12, there is no need for a pragma, and also no way to disable it short of switching versions of Perl. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Tue, 7 Oct 2008 09:05:31 -0500
From: "David Nicol" <davidnicol [...] gmail.com>
Download (untitled) / with headers
text/plain 394b
On Mon, Oct 6, 2008 at 11:04 PM, Glenn Linderman <perl@nevcal.com> wrote: Show quoted text
> \w is meaningless on binary data, for example, although character classes (could > be called byte classes) could still be useful without character semantics.
Lets say one is faced with a legacy delimited file that uses 0xFF for a separator. Running use bytes; @strings = $data =~ /(\w+)/g; could be handy.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Tue, 07 Oct 2008 09:20:00 -0700
To: David Nicol <davidnicol [...] gmail.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 1.6k
On approximately 10/7/2008 7:05 AM, came the following characters from the keyboard of David Nicol: Show quoted text
> On Mon, Oct 6, 2008 at 11:04 PM, Glenn Linderman <perl@nevcal.com> wrote:
>> \w is meaningless on binary data, for example, although character classes (could >> be called byte classes) could still be useful without character semantics.
> > Lets say one is faced with a legacy delimited file that uses 0xFF for > a separator. Running > > use bytes; > @strings = $data =~ /(\w+)/g; > > could be handy.
I guess your legacy delimited file is intended to be ASCII text, with each string delimited by 0xFF, but that is only a guess, since you didn't make it clear. @strings = split ( /\xFF/, $data ) would do the same job, be independent of "use bytes;", and allow for punctuation and control characters in the @strings. You didn't state that the @strings should contain only alphanumerics, but your code does. Of course, even if the @strings are supposed to only contain alphanumerics, your code would treat punctuation and control characters as additional delimiters and not only ignore the error case, but make it impossible to detect without reexamining $data. My code would treat only \xFF as delimiters (per your specification), and then additional code could be written to check the resulting @strings for validity as appropriate. You'll need to contrive a more useful, and more completely specified example to be convincing. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Subject: Re: [perl #58182] Unicode problem
Date: Tue, 14 Oct 2008 21:26:24 -0600
To: perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.1k
From the little feedback I got on this issue and my own thoughts, I've developed a straw proposal for comment. I propose a global flag that says whether or not the mode previously outlined (to give full Unicode semantics to characters in the full latin1 range even when not stored as utf8) is in effect or not. This flag will be turned on or off through a lexically scoped pragma. The default for 5.12 will be on. If this gets put into 5.10.x, the mode will be off. This mode will be subservient to "use bytes". That is, whenever the bytes mode is in effect, this new mode will not be. This is in part to preserve compatibility with existing programs that explicitly use the bytes pragma. If a string is defined under one mode but looked at under the other, the mode in effect at the time of interpretation will be the one used. A pattern, however, is compiled, and that compilation will remain in effect even if the mode changes. One could argue about whether the last two paragraphs are the best or not, but doing them otherwise is a lot harder, and it is my that it would be the very rare program that would want to toggle between these modes, so that in practice it doesn't matter. Comments?
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Wed, 15 Oct 2008 07:13:06 +0200
To: "karl williamson" <public [...] khwilliamson.com>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 1.5k
2008/10/15 karl williamson <public@khwilliamson.com>: Show quoted text
> From the little feedback I got on this issue and my own thoughts, I've > developed a straw proposal for comment. > > I propose a global flag that says whether or not the mode previously > outlined (to give full Unicode semantics to characters in the full latin1 > range even when not stored as utf8) is in effect or not. This flag will be > turned on or off through a lexically scoped pragma. The default for 5.12 > will be on. If this gets put into 5.10.x, the mode will be off. > > This mode will be subservient to "use bytes". That is, whenever the bytes > mode is in effect, this new mode will not be. This is in part to preserve > compatibility with existing programs that explicitly use the bytes pragma. > > If a string is defined under one mode but looked at under the other, the > mode in effect at the time of interpretation will be the one used. > > A pattern, however, is compiled, and that compilation will remain in effect > even if the mode changes. > > One could argue about whether the last two paragraphs are the best or not, > but doing them otherwise is a lot harder, and it is my that it would be the > very rare program that would want to toggle between these modes, so that in > practice it doesn't matter.
I think they're sensible. We need also to specify what will happen when we combines to qr// patterns in a larger one, where one was compiled under a mode, the other one under another. The simplest thing would be to have one of the modes (the full Unicode one) take precedence, I think.
CC: perl5-porters [...] perl.org
Subject: Re: [perl #58182] Unicode problem
Date: Wed, 15 Oct 2008 18:59:45 -0600
To: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.7k
I agree with that, and we also have to say that this is subservient as well to any locale in effect, again for backwards compatibility when this mode becomes the default. Rafael Garcia-Suarez wrote: Show quoted text
> 2008/10/15 karl williamson <public@khwilliamson.com>:
>> From the little feedback I got on this issue and my own thoughts, I've >> developed a straw proposal for comment. >> >> I propose a global flag that says whether or not the mode previously >> outlined (to give full Unicode semantics to characters in the full latin1 >> range even when not stored as utf8) is in effect or not. This flag will be >> turned on or off through a lexically scoped pragma. The default for 5.12 >> will be on. If this gets put into 5.10.x, the mode will be off. >> >> This mode will be subservient to "use bytes". That is, whenever the bytes >> mode is in effect, this new mode will not be. This is in part to preserve >> compatibility with existing programs that explicitly use the bytes pragma. >> >> If a string is defined under one mode but looked at under the other, the >> mode in effect at the time of interpretation will be the one used. >> >> A pattern, however, is compiled, and that compilation will remain in effect >> even if the mode changes. >> >> One could argue about whether the last two paragraphs are the best or not, >> but doing them otherwise is a lot harder, and it is my that it would be the >> very rare program that would want to toggle between these modes, so that in >> practice it doesn't matter.
> > I think they're sensible. > > We need also to specify what will happen when we combines to qr// > patterns in a larger one, where one was compiled under a mode, the > other one under another. The simplest thing would be to have one of > the modes (the full Unicode one) take precedence, I think. > >
CC: perl5-porters [...] perl.org
Subject: C coding questions for [perl #58182] Unicode problem
Date: Fri, 17 Oct 2008 11:53:03 -0600
To: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 6.1k
I'm ready to start hardening my preliminary experiments into production code for this problem. I'm thinking about doing it in at least three separable patches, each dealing with a different portion of it. And I have some questions about coding. I am finding as a newcomer that there is a tremendous amount to learn and I don't want to waste my time (and yours) by going down dead-ends. A don't understand a lot of the nuances of the macros and functions, and I'm afraid some aren't documented. For example, I don't know what it means when a const is added to a macro or function name, perhaps that the result is promised to not be modified by the caller? I have tried to conform to the current style with two exceptions. 1) I write a lot more comments than I see in the code. I have cut down on these a lot, but it's still more than you are used to. These can easily be removed. 2) Is there a reason that many compound statements begin on one line and are continued on another without braces, like if (a) b; ? I learned long ago not to do that, as it's too easy when modifying code to forget the braces are missing and to insert a statement between them that causes for example b to not be dependent on the if. This may show up in testing, or it may be end up a bug. Unless there's some reason like machine parsing of the code, I would like to be free to use braces in my code under these circumstances. What about time-space tradeoffs? For time efficiency, it would be good to implement some of the operations with table look ups. I could use 3 or 4 tables, each with 256 entries of U8. It this ok, or would you rather I have extra tests in code instead? And if I do use the tables, is the U8 typedef guaranteed to be unsigned and 8 bits, so that I can index into these tables safely? I see some existing code that appears to assume this, but it may not be a good example to follow, or I may be missing some place where it got tested first. I can always mask to make sure its 8 bits, but if some compilers/architectures don't really allow unsigned, then that complicates things and makes the tables bigger. The uc() function now tries to convert in place if possible. I would like to not bother with that, but to always allocate a new scalar to return the result in. The alternative is that under some conditions, the code would get partly through the conversion and discover that it needed more space than was available, and have to abandon the conversion and start over (or else do an extra pass first to figure this out, which I'm sure no one would advocate). Is it ok for me to make this change? If I have to grow a string, is it better to grow just enough to get by, or should I add some extra to minimize the possibility of having to grow again? I don't know how the memory pool is handled. I presume that eventually some malloc gets done, and its probably not for just 1 or 2 bytes. The code in the areas I've looked at currently asks for precisely the amount it needs at the moment for that string, and there is a comment about maybe having to do it a million times, but that's life. It would seem to me that if you need 3 bytes, you should grow by 6, which isn't a big waste if none more are needed, and would halve the number of growths required in the worst case. But what is the accepted method? I have to convert a string to utf8. There is a convenient function to do so, bytes_to_utf8(), but the documentation warns it is experimental. Is it ok to use this function, and if not, what should I use? And when I'm through with my first batch of changes, what should I do? I'd like to post it for code reading before submitting it as a patch. I've gotten quite a ways into the changes needed to pp.c, for example, and I have specific questions about why some things are done the way they are, etc, which I would put in comments in that place in the code. For example, I suspect that lcfirst and ucfirst have a bug that was fixed for lc and uc in an earlier patch but the writer didn't think to appply it to the other functions, but I don't know enough to be sure. My experimental changes for uc, lc, ucfirst, lcfirst, \U, \u, \L, and \l cause one existing test case group to fail. This is in uni/t/overload.t. It is testing that toggling the utf8 flag causes the case changing functions to work or not work depending on the flag's state. My changes cause the case functions to work no matter what that bit says, so these tests fail. Is there some other point to these tests that I should be aware of, so I can revise them appropriately? Thanks Rafael Garcia-Suarez wrote: Show quoted text
> 2008/10/15 karl williamson <public@khwilliamson.com>:
>> From the little feedback I got on this issue and my own thoughts, I've >> developed a straw proposal for comment. >> >> I propose a global flag that says whether or not the mode previously >> outlined (to give full Unicode semantics to characters in the full latin1 >> range even when not stored as utf8) is in effect or not. This flag will be >> turned on or off through a lexically scoped pragma. The default for 5.12 >> will be on. If this gets put into 5.10.x, the mode will be off. >> >> This mode will be subservient to "use bytes". That is, whenever the bytes >> mode is in effect, this new mode will not be. This is in part to preserve >> compatibility with existing programs that explicitly use the bytes pragma. >> >> If a string is defined under one mode but looked at under the other, the >> mode in effect at the time of interpretation will be the one used. >> >> A pattern, however, is compiled, and that compilation will remain in effect >> even if the mode changes. >> >> One could argue about whether the last two paragraphs are the best or not, >> but doing them otherwise is a lot harder, and it is my that it would be the >> very rare program that would want to toggle between these modes, so that in >> practice it doesn't matter.
> > I think they're sensible. > > We need also to specify what will happen when we combines to qr// > patterns in a larger one, where one was compiled under a mode, the > other one under another. The simplest thing would be to have one of > the modes (the full Unicode one) take precedence, I think. > >
CC: perl5-porters [...] perl.org
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Sat, 18 Oct 2008 09:35:01 +0200
To: "karl williamson" <public [...] khwilliamson.com>
From: "Rafael Garcia-Suarez" <rgarciasuarez [...] gmail.com>
Download (untitled) / with headers
text/plain 5.8k
2008/10/17 karl williamson <public@khwilliamson.com>: Show quoted text
> I'm ready to start hardening my preliminary experiments into production code > for this problem. I'm thinking about doing it in at least three separable > patches, each dealing with a different portion of it. > > And I have some questions about coding. I am finding as a newcomer that > there is a tremendous amount to learn and I don't want to waste my time (and > yours) by going down dead-ends. A don't understand a lot of the nuances of > the macros and functions, and I'm afraid some aren't documented. For > example, I don't know what it means when a const is added to a macro or > function name, perhaps that the result is promised to not be modified by the > caller?
The "const" name in a macro name is (IIRC) always related to the use of the "const" type modifier to qualify its return value. That is, you can't assign it to a non-const variable. Show quoted text
> I have tried to conform to the current style with two exceptions. 1) I > write a lot more comments than I see in the code. I have cut down on these > a lot, but it's still more than you are used to. These can easily be
I have absolutely no problem with comments ! especially in code that hairy. I should write more, too. Show quoted text
> removed. 2) Is there a reason that many compound statements begin on one > line and are continued on another without braces, like > if (a) > b; > ? I learned long ago not to do that, as it's too easy when modifying code > to forget the braces are missing and to insert a statement between them that > causes for example b to not be dependent on the if. This may show up in > testing, or it may be end up a bug. Unless there's some reason like machine > parsing of the code, I would like to be free to use braces in my code under > these circumstances.
Please do. Another advantage of using braces is that you can add statements in the "then" clause without modifying the "if" line, if you write your ifs like this: if (condition) { ... } Less formatting changes, better history. Show quoted text
> What about time-space tradeoffs? For time efficiency, it would be good to > implement some of the operations with table look ups. I could use 3 or 4 > tables, each with 256 entries of U8. It this ok, or would you rather I have > extra tests in code instead?
Perl is usually more optimized for speed than for memory. Show quoted text
> And if I do use the tables, is the U8 typedef guaranteed to be unsigned and > 8 bits, so that I can index into these tables safely? I see some existing > code that appears to assume this, but it may not be a good example to > follow, or I may be missing some place where it got tested first. I can > always mask to make sure its 8 bits, but if some compilers/architectures > don't really allow unsigned, then that complicates things and makes the > tables bigger.
I think it's guaranteed to be unsigned, but not 8 bits. The U8SIZE symbol gives the size of an U8 in bytes. I'm not aware of any platform where it's not 1 byte, though. Any C portability expert on this? Show quoted text
> The uc() function now tries to convert in place if possible. I would like > to not bother with that, but to always allocate a new scalar to return the > result in. The alternative is that under some conditions, the code would > get partly through the conversion and discover that it needed more space > than was available, and have to abandon the conversion and start over (or > else do an extra pass first to figure this out, which I'm sure no one would > advocate). Is it ok for me to make this change?
I think it is. The old PV will be collected. And we can reoptimize it later. Show quoted text
> If I have to grow a string, is it better to grow just enough to get by, or > should I add some extra to minimize the possibility of having to grow again? > I don't know how the memory pool is handled. I presume that eventually > some malloc gets done, and its probably not for just 1 or 2 bytes. The code > in the areas I've looked at currently asks for precisely the amount it needs > at the moment for that string, and there is a comment about maybe having to > do it a million times, but that's life. It would seem to me that if you > need 3 bytes, you should grow by 6, which isn't a big waste if none more are > needed, and would halve the number of growths required in the worst case. > But what is the accepted method?
I'm not much familiar with perl's internal memory pools. And that kind of behaviour is difficult to choose without benchmarks. I would say, at first, grow by just what is needed. Show quoted text
> I have to convert a string to utf8. There is a convenient function to do > so, bytes_to_utf8(), but the documentation warns it is experimental. Is it > ok to use this function, and if not, what should I use?
It's experimental since many years now. I think we could remove the "experimental" label now. Show quoted text
> And when I'm through with my first batch of changes, what should I do? I'd > like to post it for code reading before submitting it as a patch. I've > gotten quite a ways into the changes needed to pp.c, for example, and I have > specific questions about why some things are done the way they are, etc, > which I would put in comments in that place in the code. For example, I > suspect that lcfirst and ucfirst have a bug that was fixed for lc and uc in > an earlier patch but the writer didn't think to appply it to the other > functions, but I don't know enough to be sure.
Please ask here. Are you familiar with git already, by the way? Show quoted text
> My experimental changes for uc, lc, ucfirst, lcfirst, \U, \u, \L, and \l > cause one existing test case group to fail. This is in uni/t/overload.t. > It is testing that toggling the utf8 flag causes the case changing > functions to work or not work depending on the flag's state. My changes > cause the case functions to work no matter what that bit says, so these > tests fail. Is there some other point to these tests that I should be aware > of, so I can revise them appropriately?
You can revise them.
CC: karl williamson <public [...] khwilliamson.com>, perl5-porters [...] perl.org
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Sun, 19 Oct 2008 11:20:17 +0100
To: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 2.8k
On Sat, Oct 18, 2008 at 09:35:01AM +0200, Rafael Garcia-Suarez wrote: Show quoted text
> 2008/10/17 karl williamson <public@khwilliamson.com>:
Show quoted text
> > I have tried to conform to the current style with two exceptions. 1) I
"style" singular? :-) Show quoted text
> > write a lot more comments than I see in the code. I have cut down on these > > a lot, but it's still more than you are used to. These can easily be
> > I have absolutely no problem with comments ! especially in code that hairy. > > I should write more, too.
More comments good. Please don't cut down, if writing more comes naturally to you. Show quoted text
> > And if I do use the tables, is the U8 typedef guaranteed to be unsigned and > > 8 bits, so that I can index into these tables safely? I see some existing > > code that appears to assume this, but it may not be a good example to > > follow, or I may be missing some place where it got tested first. I can > > always mask to make sure its 8 bits, but if some compilers/architectures > > don't really allow unsigned, then that complicates things and makes the > > tables bigger.
> > I think it's guaranteed to be unsigned, but not 8 bits. The U8SIZE > symbol gives the size of an U8 in bytes. I'm not aware of any platform > where it's not 1 byte, though. Any C portability expert on this?
It's always going to be an unsigned char, it's always going to be the smallest type on the platform, and it's always going to be at least 8 bits. I'm not sure if anyone has access to anything esoteric with 32 (or 9?) bit chars, on which they could try compiling perl. Things can go wrong with your code's assumptions if it's more than 8 bits? Show quoted text
> > My experimental changes for uc, lc, ucfirst, lcfirst, \U, \u, \L, and \l > > cause one existing test case group to fail. This is in uni/t/overload.t. > > It is testing that toggling the utf8 flag causes the case changing > > functions to work or not work depending on the flag's state. My changes > > cause the case functions to work no matter what that bit says, so these > > tests fail. Is there some other point to these tests that I should be aware > > of, so I can revise them appropriately?
> > You can revise them.
They were probably written by me, to ensure that the current behaviour is consistent with or without overloading. In particular that overloading couldn't break if the subroutine implementing it was inconsistent (or malicious) in what it returned. What had been happening was that the UTF-8 flag was getting set on the (outer) scalar, and then if the implementation on a subsequent call returned something that was not UTF-8 (and marked as not UTF-8) then the byte sequence was propagated, but not the change to the UTF-8 flag, resulting in a corrupt scalar. Annotated history: http://public.activestate.com/cgi-bin/perlbrowse/b/t/uni/overload.t So yes, revise them to behave correctly as per the new world order. Nicholas Clark
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Sun, 19 Oct 2008 20:59:57 -0600
To: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, perl5-porters [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.2k
Nicholas Clark wrote: Show quoted text
> ... >
>>> And if I do use the tables, is the U8 typedef guaranteed to be unsigned and >>> 8 bits, so that I can index into these tables safely? I see some existing >>> code that appears to assume this, but it may not be a good example to >>> follow, or I may be missing some place where it got tested first. I can >>> always mask to make sure its 8 bits, but if some compilers/architectures >>> don't really allow unsigned, then that complicates things and makes the >>> tables bigger.
>> I think it's guaranteed to be unsigned, but not 8 bits. The U8SIZE >> symbol gives the size of an U8 in bytes. I'm not aware of any platform >> where it's not 1 byte, though. Any C portability expert on this?
> > It's always going to be an unsigned char, it's always going to be the > smallest type on the platform, and it's always going to be at least 8 bits. > > I'm not sure if anyone has access to anything esoteric with 32 (or 9?) bit > chars, on which they could try compiling perl. Things can go wrong with your > code's assumptions if it's more than 8 bits? >
I only wanted to know if I have to be think about exceeding array bounds. If it's exactly 8 bits and unsigned, there's no way for it to reference outside a 256 element array.
CC: Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>, perl5-porters [...] perl.org
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Mon, 20 Oct 2008 19:27:43 +0100
To: karl williamson <public [...] khwilliamson.com>
From: Nicholas Clark <nick [...] ccl4.org>
On Sun, Oct 19, 2008 at 08:59:57PM -0600, karl williamson wrote: Show quoted text
> I only wanted to know if I have to be think about exceeding array > bounds. If it's exactly 8 bits and unsigned, there's no way for it to > reference outside a 256 element array.
I think that the perl source code is already making that assumption in places. But I doubt that there's a size or speed penalty in masking it with 0xFF, as any sane optimiser would spot that it's a no-op and eliminate the code. I tried: $ cat index.c #include <stdlib.h> #include <stdio.h> #ifndef MASK # define MASK #endif static unsigned char buffer[256]; int main (int argc, char **argv) { unsigned int count = sizeof(buffer); while (count--) buffer[count MASK] = ~count; while (*++argv) { const unsigned char i = (unsigned char) atoi(*argv); printf("%s: %d\n", *argv, buffer[i MASK]); } return 0; } $ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o $ gcc -Wall -o index -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o Nicholas Clark
CC: karl williamson <public [...] khwilliamson.com>, Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>, perl5-porters [...] perl.org
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Tue, 21 Oct 2008 01:07:13 +0200
To: Nicholas Clark <nick [...] ccl4.org>
From: Marcus Holland-Moritz <mhx-perl [...] gmx.net>
Download (untitled) / with headers
text/plain 518b
On 2008-10-20, at 19:27:43 +0100, Nicholas Clark wrote: Show quoted text
> $ gcc -Wall -c -O index.c > $ ls -l index.o > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o > $ gcc -Wall -o index -O -DMASK='& 255' index.c > $ ls -l index.o > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o
Despite the fact that you're most probably right, I don't think your second run of gcc actually updated index.o... ;) Marcus -- The only difference between a car salesman and a computer salesman is that the car salesman knows he's lying.
Download signature.asc
application/pgp-signature 197b

Message body not shown because it is not plain text.

CC: karl williamson <public [...] khwilliamson.com>, Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>, perl5-porters [...] perl.org
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Tue, 21 Oct 2008 04:09:11 +0100
To: Marcus Holland-Moritz <mhx-perl [...] gmx.net>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 957b
On Tue, Oct 21, 2008 at 01:07:13AM +0200, Marcus Holland-Moritz wrote: Show quoted text
> On 2008-10-20, at 19:27:43 +0100, Nicholas Clark wrote: >
> > $ gcc -Wall -c -O index.c > > $ ls -l index.o > > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o > > $ gcc -Wall -o index -O -DMASK='& 255' index.c > > $ ls -l index.o > > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o
> > Despite the fact that you're most probably right, I don't > think your second run of gcc actually updated index.o... ;)
Well spotted. This one did :-) $ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:06 index.o $ gcc -Wall -c -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:07 index.o Show your data as well as your conclusions. It lets other people verify your conclusions. (And, sometimes, more importantly, draw different conclusions if your conclusions are wrong, but your observations are valid) Nicholas Clark
Subject: Re: C coding questions for [perl #58182] Unicode problem
Date: Tue, 21 Oct 2008 08:20:26 +0200
To: perl5-porters [...] perl.org
From: Bo Lindbergh <blgl [...] hagernas.com>
Download (untitled) / with headers
text/plain 1.6k
In article <20081021030911.GL49335@plum.flirble.org>, nick@ccl4.org (Nicholas Clark) wrote: Show quoted text
> On Tue, Oct 21, 2008 at 01:07:13AM +0200, Marcus Holland-Moritz wrote:
> > On 2008-10-20, at 19:27:43 +0100, Nicholas Clark wrote: > >
> > > $ gcc -Wall -c -O index.c > > > $ ls -l index.o > > > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o > > > $ gcc -Wall -o index -O -DMASK='& 255' index.c > > > $ ls -l index.o > > > -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o
> > > > Despite the fact that you're most probably right, I don't > > think your second run of gcc actually updated index.o... ;)
> > Well spotted. This one did :-) > > $ gcc -Wall -c -O index.c > $ ls -l index.o > -rw-r--r-- 1 nick nick 1752 Oct 21 04:06 index.o > $ gcc -Wall -c -O -DMASK='& 255' index.c > $ ls -l index.o > -rw-r--r-- 1 nick nick 1752 Oct 21 04:07 index.o > > Show your data as well as your conclusions. It lets other people verify > your conclusions. (And, sometimes, more importantly, draw different > conclusions if your conclusions are wrong, but your observations are valid)
A small change in instruction count can be hidden by alignment padding. Try comparing the actual assembly output instead. $ gcc -Wall -O -o index1.s -S index.c $ gcc -DMASK='& 255' -Wall -O -o index2.s -S index.c $ diff index1.s index2.s 21c21 < la r2,lo16(_buffer-"L00000000001$pb")(r2) --- Show quoted text
> la r11,lo16(_buffer-"L00000000001$pb")(r2)
24a25 Show quoted text
> rlwinm r2,r9,0,24,31
26c27 < stbx r0,r2,r9 --- Show quoted text
> stbx r0,r11,r2
This is an Apple-built gcc 4.0.1 for 32-bit PowerPC, and a rotate-and-mask instruction _is_ generated. /Bo Lindbergh
Subject: What to call the pragma for [perl #58182] Unicode problem
Date: Thu, 23 Oct 2008 13:10:30 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1012b
I'm implementing some changes to fix this, which Rafael has indicated could be a new mode of operation, default off in 5.10, and on in 5.12, and indicated so by use of a lexically-scoped pragma. The problem is the name to use in the pragma. In my experimental version, I'm using use latin1; no latin1; But, maybe someone has a better idea. I thought someone had suggested use unicode; but I can't find the email that said that, and I don't think no unicode; gives the right idea, as we are still using unicode semantics for characters outside the range of 128-255. use unicode_semantics; is too long, and again no unicode_semantics overstates what is turned off. What is really meant is "use unicode semantics for latin1 non-ascii characters" and "no unicode semantics for latin1 non-ascii characters" Another way of looking at it might be no C_locale; use C_locale; but I don't like that as well for several reasons, one of which people may not know what the C locale is. So, any better ideas?
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Thu, 23 Oct 2008 13:08:08 -0700
To: karl williamson <public [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 1.2k
On approximately 10/23/2008 12:10 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> we are still using unicode semantics for > characters outside the range of 128-255.
Show quoted text
> So, any better ideas?
Better? Well, at least more ideas... These all to enable your fixes, swap use/no to disable. use fix_c128_255; # :) use pure_uni; use uni_pure; use unipure; use codemode; use all_uni; use clean_uni; use real_uni; no uni_compat; no broken_unicode; no broken_uni; no buggy_unicode; no buggy_uni; I extremely dislike use latin1; because implicit conversions would happen even with no latin1; (same problem you have with no unicode;). no C_locale; has the problem that locale implies things like number, money, and date and time formatting, and character classes and collation, in addition to (perhaps) a default character encoding, and there is already a broken? locale module, I believe, which would be confusing, and C_locale wouldn't mean any of these things. Of my ideas above, I sort of prefer the last 4... but maybe someone else will suggest the best name. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: "karl williamson" <public [...] khwilliamson.com>, "Perl5 Porters" <perl5-porters [...] perl.org>
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Thu, 23 Oct 2008 15:20:56 -0500
To: "Glenn Linderman" <perl [...] nevcal.com>
From: "David Nicol" <davidnicol [...] gmail.com>
you could go pure-historical and set a naming precedent for such things with use fix58182;
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Thu, 23 Oct 2008 13:29:32 -0700
To: David Nicol <davidnicol [...] gmail.com>
From: "Kevin J. Woolley" <kevinw [...] activestate.com>
Download (untitled) / with headers
text/plain 520b
David Nicol wrote: Show quoted text
> you could go pure-historical and set a naming precedent for such things with > > use fix58182;
You forgot to use a smilie. Don't scare me like that! ;) Seems to me that something that mentions the version of Unicode that the behaviour conforms to might be worth a shot. If I remember the details of the underlying problem correctly, then: use unicode_31; May be a good way to go. This would at least give someone unfamiliar with the whole issue a place to start looking. Cheers, kjw
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Fri, 24 Oct 2008 00:03:14 +0200
To: perl5-porters [...] perl.org
From: Moritz Lenz <moritz [...] casella.verplant.org>
Download (untitled) / with headers
text/plain 1.5k
karl williamson wrote: Show quoted text
> I'm implementing some changes to fix this, which Rafael has indicated > could be a new mode of operation, default off in 5.10, and on in 5.12, > and indicated so by use of a lexically-scoped pragma. > > The problem is the name to use in the pragma. In my experimental > version, I'm using > > use latin1; > no latin1;
I don't think that quite cuts it. Show quoted text
> But, maybe someone has a better idea.
I propose use unisane; "uni" is already used in some places as an abbreviation of "Unicode" (like in the names of perluniintro and perlunitut man pages), and IMHO the old behaviour is quite insane. So if you say "no unisane" you'll clearly stating that you want insane behaviour, and you'll get it. Show quoted text
> I thought someone had suggested > use unicode; > but I can't find the email that said that, and I don't think > no unicode; > gives the right idea, as we are still using unicode semantics for > characters outside the range of 128-255.
Aye. Show quoted text
> use unicode_semantics; > is too long, and again > no unicode_semantics > overstates what is turned off. > > What is really meant is > "use unicode semantics for latin1 non-ascii characters" and > "no unicode semantics for latin1 non-ascii characters" > > Another way of looking at it might be > no C_locale; > use C_locale; > but I don't like that as well for several reasons, one of which people > may not know what the C locale is.
Locales and Unicode are somewhat orthogonal concepts, and there's already too much confusion about their interaction without you adding even more to it ;-) Cheers, Moritz
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Fri, 24 Oct 2008 09:19:25 +0200
To: perl5-porters [...] perl.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Thu, 23 Oct 2008 13:29:32 -0700, "Kevin J. Woolley" <kevinw@activestate.com> wrote: Show quoted text
> David Nicol wrote: >
> > you could go pure-historical and set a naming precedent for such things with > > > > use fix58182;
> > You forgot to use a smilie. Don't scare me like that! ;) > > Seems to me that something that mentions the version of Unicode that the > behaviour conforms to might be worth a shot. If I remember the details > of the underlying problem correctly, then: > > use unicode_31; > > May be a good way to go. This would at least give someone unfamiliar > with the whole issue a place to start looking.
I agree with Kevin. uni isn't descriptive enough. use unicode_<whatever>; seems a much more sane approach. -- H.Merijn Brand Amsterdam Perl Mongers http://amsterdam.pm.org/ using & porting perl 5.6.2, 5.8.x, 5.10.x, 5.11.x on HP-UX 10.20, 11.00, 11.11, 11.23, and 11.31, SuSE 10.1, 10.2, and 10.3, AIX 5.2, and Cygwin. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Fri, 24 Oct 2008 10:19:37 +0200
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+news [...] isolution.nl>
Download (untitled) / with headers
text/plain 129b
"H.Merijn Brand" schreef: Show quoted text
> use unicode_<whatever>;
use dwim "higher-ASCII"; ;-) -- Affijn, Ruud "Gewoon is een tijger."
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Fri, 24 Oct 2008 19:48:16 +0200
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 189b
* karl williamson <public@khwilliamson.com> [2008-10-23 21:15]: Show quoted text
> So, any better ideas?
`use unicode8bit`/`no unicode8bit`? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Sun, 26 Oct 2008 15:25:26 -0500
To: perl5-porters [...] perl.org
From: "David Nicol" <davidnicol [...] gmail.com>
Show quoted text
> `use unicode8bit`/`no unicode8bit`? >
use encode80toFF / no encode80toFF
Subject: Re: What to call the pragma for [perl #58182] Unicode problem
Date: Sun, 26 Oct 2008 23:40:50 +0100
To: perl5-porters [...] perl.org
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Download (untitled) / with headers
text/plain 504b
* David Nicol <davidnicol@gmail.com> [2008-10-26 21:30]: Show quoted text
> > `use unicode8bit`/`no unicode8bit`?
> > use encode80toFF / no encode80toFF
Uppercase bad. Make it `encode128to255` if you must. But the name you proposed does little to explicate that the issue in question is Unicode semantics. “Encode” points in the general direction by reference to precedent in Perl terminology, but the precedent is vague and so is the pointer to it. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: [perl #58182] Unicode bug: More questions about coding
Date: Tue, 18 Nov 2008 18:55:05 -0700
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
I'm almost ready to submit my proposed changes for the uc(), lcfirst(), etc. functions for code review. But I have several more questions. These functions are all in pp.c. Currently, if in a "use bytes" scope these functions treat the data as strict ASCII, and change the case accordingly. Someone earlier suggested that this is a bug, that this mode is really for binary data only, and that the case should not change in this mode. What should I do? There are a couple cases where a string has to be converted to utf8. bytes_to_utf8() assumes the worst case that the new string will occupy 2n+1 bytes, and allocates a new scalar with that size. The code in these functions check every time through the processing characters loop to see if more space is needed, and if so grows the scalar by just that amount. (This happens only in Unicode where the worst case may be more than 2n) Which precedent would it be preferable for me to follow when the worst case is 2n? The ucfirst() and lcfirst() functions are implemented in one function which branches at the crucial moment to do the upper or lower case and then comes back together. Comments in the code ask if the same thing should happen for lc() and uc(). There are now several differences between the two, but the vast majority of these routines is identical. Should I do the combining or let it alone? Finally, it would be trivial to change ucfirst() and lcfirst() so that if handed a utf8 string in which the first character (the only one being operated on) is in the strict ascii range, then to look up its case change in a compiled-in table instead of going out to the filesystem to look it up, as it must do for the general case. The extra expense when this isn't true is an extra comparison, but if it is true, there is quite a bit of savings. Shall I make this change? An extension could be to even do this on characters in the 128-255 range, but there would need to be more extensive code changes, and extra tests, so I don't think that this is worth doing.
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode bug: More questions about coding
Date: Tue, 18 Nov 2008 22:14:22 -0800
To: karl williamson <public [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 5.6k
On approximately 11/18/2008 5:55 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> I'm almost ready to submit my proposed changes for the uc(), lcfirst(), > etc. functions for code review. But I have several more questions.
I have no answers, only comments to be sure that you have considered things, and some opinions. Show quoted text
> These functions are all in pp.c. > > Currently, if in a "use bytes" scope these functions treat the data as > strict ASCII, and change the case accordingly. Someone earlier > suggested that this is a bug, that this mode is really for binary data > only, and that the case should not change in this mode. What should I do?
I think the two options are: 1) make these functions noops in "use bytes" mode, bringing them in line with the "use bytes" documentation. 2) case convert only ASCII letters [A-Za-z] (no code change) and change the "use bytes" documentation to admit that these functions do affect ASCII characters even in "use bytes" mode. The second case is more compatible with today's actual (vs. documented) behavior. Noops can also be obtained in other ways, so allowing them to continue to operate on the ASCII charcaters gives more options. Show quoted text
> There are a couple cases where a string has to be converted to utf8. > bytes_to_utf8() assumes the worst case that the new string will occupy > 2n+1 bytes, and allocates a new scalar with that size. The code in > these functions check every time through the processing characters loop > to see if more space is needed, and if so grows the scalar by just that > amount. (This happens only in Unicode where the worst case may be more > than 2n) Which precedent would it be preferable for me to follow when > the worst case is 2n?
Hmm. So starting with bytes, the worst case is still 2n, no? But that assumes that (1) you need to convert from bytes to UTF-8 and (2) that the number of characters in the 128-255 range is significant. I'm not sure what you already know at the decision point... I would guess that you know only your position in the string, when you first realize that it must be lengthened. Since even in Latin-1, most of the case shifting stays within Latin-1, it seems unlikely that many characters will grow. But if you are forced to convert to UTF-8, then ... if you kept track of the number of characters seen so far with the high bit set, then at the decision point you would know to allocate 1 byte for each ASCII character seen so far, 2 bytes for each byte seen so far with the high bit set, and 2 bytes for each character not yet processed. That could result in a memory savings vs. using twice the total length, yet would avoid repetitive reallocations. But starting with Unicode, I don't know if there is a rule, but are the uppercase and lowercase characters far enough apart, that they change the size of their representation? If there are such, it would seem to be few, because most of the uppercase and lowercase characters of each type are grouped together in the same Unicode block. So the current algorithm sounds appropriate for this. Show quoted text
> The ucfirst() and lcfirst() functions are implemented in one function > which branches at the crucial moment to do the upper or lower case and > then comes back together. Comments in the code ask if the same thing > should happen for lc() and uc(). There are now several differences > between the two, but the vast majority of these routines is identical. > Should I do the combining or let it alone?
You are coding it... you decide. It can be smoked on blead, and merged back to 5.10.x only once it is reasonably certain to be correct. Show quoted text
> Finally, it would be trivial to change ucfirst() and lcfirst() so that > if handed a utf8 string in which the first character (the only one being > operated on) is in the strict ascii range, then to look up its case > change in a compiled-in table instead of going out to the filesystem to > look it up, as it must do for the general case. The extra expense when > this isn't true is an extra comparison, but if it is true, there is > quite a bit of savings. Shall I make this change? An extension could > be to even do this on characters in the 128-255 range, but there would > need to be more extensive code changes, and extra tests, so I don't > think that this is worth doing.
Sounds like a fair win, overall. There are a significant number of words in Latin-based languages that start with unaccented (ASCII) first letters. Regarding the 128-255 range, it would be possible (I think) to make a case shifting table indexed by ord, that contained (1) the case shifted character (2) 0x0 or 0xff or 0x94 to flag that the table doesn't work for this character. Or the table entries could be two bytes wide, so a truly out of range character could be used 0xffff. (The one byte suggestions are thought to be unlikely to true character strings that get passed to these functions, so a performance hit when encountered wouldn't be onerous.) Regarding the 0-128 range, it is possible to do the OR to lowercase and AND to uppercase, if you check the range to be letters. Not clear that such is faster, only smaller. The table would be more general, and faster, and work for some of the non-ASCII. The table for ucfirst & lcfirst, could also be used for uc and lc, no? You don't mention if that would be a win in those cases, or maybe they already have a table? Granted it only helps people that operate in the ASCII (or maybe Latin-1) range, but there are a lot of them. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: Juerd Waalboer <juerd [...] convolution.nl>
Subject: [perl #58182] Unicode bug: code review request
Date: Sat, 22 Nov 2008 15:24:22 -0700
To: Perl5 Porters <perl5-porters [...] perl.org>, Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.5k
Attached for code review are changes to pp.c to enable case handling of characters in the range 128-255. This is not a patch yet. Since I'm new to changing the perl source, I wanted to get feedback before submitting a real patch. Also, changes in a couple of other files depend on some unresolved issues. There are two attachments, one is in standard patch format. The other is an html file containing another type of diff that I prefer. In it, the cyan background is for deleted things, the yellow for added; changes in line indentation are not shown. I'm wondering what sorts of things I might be overlooking. I do know that I don't know much about how overloading or magic might affect these routines. I don't think my changes would affect these, but then, I don't know much about these. There are a couple of comments marked TODO, which means I have special questions about them. One of my design goals was to not slow things down unnecessarily. I think that, if anything, I have sped things up. One area of concern I have, though, in that regard is in the loop in pp_uc. I don't know anything about modern optimizers. Before, it was a very tight loop, which an optimizer could reasonably handle. Now, the mainstream is a tight loop, but there is a conditional in it which can cause a significant amount of code to be executed, that could fool the optimizer. Perhaps the non-mainstream case should be put in a function. I added some code to improve the efficiency of utf8 handling, so that if perl has hard-coded into it the upper and lower cases of a character, it doesn't have to go out to the general Unicode functions. The most changes (as opposed to additions) are in uc_first(). Most of these come from rearranging the code so that in all cases the changed character is known before processing the rest of the string. Previously, only if it was encoded in utf-8 would this preliminary step be done. Otherwise, it was done as it went along. An earlier author contemplated combining lc and uc into one function. I haven't done that, yet. At the expense of two extra comparisons per function call, I could save, perhaps as much space in perl as I've used up by adding the code to do the new functionality. I have added several code-generating macros. Normally, I don't like these, but I think it makes things more readable here. I have added many more comments than are typical in the Perl source. I earlier got feedback that this might be a good thing. If I don't get any feedback, I'll end up submitting a patch. This does pass all regression tests. With the new behavior enabled it fails one, which I have gotten permission from the author to change. If you were to try compiling this, you would get missing symbol errors, from macros in headers. Several of these should be obvious what they mean, but here are definitions of ones that may not be: IN_UNI_8_BIT is true if and only if characters in the range 128-255 are to be treated as having upper and lower case as defined by Unicode. When false, these routines should deliver identical results as they always have. toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its lower case as defined by Unicode. toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns its upper case as defined by Unicode, except for 3 tricky characters, for which it returns a modified value, as explained in the code's comments. UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode character c whose utf8 is known to take exactly two bytes. Similarly for UTF8_TWO_BYTE_LO
Download patch
text/plain 24.5k

Message body is not shown because it is too large.

CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode bug: More questions about coding
Date: Sat, 22 Nov 2008 15:54:46 -0700
To: Glenn Linderman <perl [...] NevCal.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 6.9k
Glenn Linderman wrote: Show quoted text
> On approximately 11/18/2008 5:55 PM, came the following characters from > the keyboard of karl williamson:
>> I'm almost ready to submit my proposed changes for the uc(), >> lcfirst(), etc. functions for code review. But I have several more >> questions.
> > > I have no answers, only comments to be sure that you have considered > things, and some opinions. > >
>> These functions are all in pp.c. >> >> Currently, if in a "use bytes" scope these functions treat the data as >> strict ASCII, and change the case accordingly. Someone earlier >> suggested that this is a bug, that this mode is really for binary data >> only, and that the case should not change in this mode. What should I >> do?
> > > I think the two options are: > > 1) make these functions noops in "use bytes" mode, bringing them in line > with the "use bytes" documentation. > > 2) case convert only ASCII letters [A-Za-z] (no code change) and change > the "use bytes" documentation to admit that these functions do affect > ASCII characters even in "use bytes" mode. > > The second case is more compatible with today's actual (vs. documented) > behavior. Noops can also be obtained in other ways, so allowing them to > continue to operate on the ASCII charcaters gives more options. >
In thinking about this some more, it seems like if we wanted to make things noops in bytes mode, the place to do it would be the parser (or whatever it is that sets up the execution stack) so that functions that are noops aren't even called. Show quoted text
>
>> There are a couple cases where a string has to be converted to utf8. >> bytes_to_utf8() assumes the worst case that the new string will occupy >> 2n+1 bytes, and allocates a new scalar with that size. The code in >> these functions check every time through the processing characters >> loop to see if more space is needed, and if so grows the scalar by >> just that amount. (This happens only in Unicode where the worst case >> may be more than 2n) Which precedent would it be preferable for me to >> follow when the worst case is 2n?
> > > Hmm. So starting with bytes, the worst case is still 2n, no? But that > assumes that (1) you need to convert from bytes to UTF-8 and (2) that > the number of characters in the 128-255 range is significant. I'm not > sure what you already know at the decision point... I would guess that > you know only your position in the string, when you first realize that > it must be lengthened. Since even in Latin-1, most of the case shifting > stays within Latin-1, it seems unlikely that many characters will grow. > But if you are forced to convert to UTF-8, then ... if you kept track > of the number of characters seen so far with the high bit set, then at > the decision point you would know to allocate 1 byte for each ASCII > character seen so far, 2 bytes for each byte seen so far with the high > bit set, and 2 bytes for each character not yet processed. That could > result in a memory savings vs. using twice the total length, yet would > avoid repetitive reallocations. > > But starting with Unicode, I don't know if there is a rule, but are the > uppercase and lowercase characters far enough apart, that they change > the size of their representation? If there are such, it would seem to > be few, because most of the uppercase and lowercase characters of each > type are grouped together in the same Unicode block. So the current > algorithm sounds appropriate for this. >
The worst case in Unicode is 3:1. But I decided to choose the worst case, as that is what is done when a string is upgraded to utf8. Show quoted text
>
>> The ucfirst() and lcfirst() functions are implemented in one function >> which branches at the crucial moment to do the upper or lower case and >> then comes back together. Comments in the code ask if the same thing >> should happen for lc() and uc(). There are now several differences >> between the two, but the vast majority of these routines is identical. >> Should I do the combining or let it alone?
> > > You are coding it... you decide. It can be smoked on blead, and merged > back to 5.10.x only once it is reasonably certain to be correct. >
I still am feeling my way about the change culture here. I've worked on projects where only a major bug warranted a change--customers had to live with the ones that management didn't deem major enough. And I've worked on projects where the developer was God, and could do whatever they liked. I prefer ones where there is a discussion and consensus as to what should happen. Show quoted text
>
>> Finally, it would be trivial to change ucfirst() and lcfirst() so that >> if handed a utf8 string in which the first character (the only one >> being operated on) is in the strict ascii range, then to look up its >> case change in a compiled-in table instead of going out to the >> filesystem to look it up, as it must do for the general case. The >> extra expense when this isn't true is an extra comparison, but if it >> is true, there is quite a bit of savings. Shall I make this change? >> An extension could be to even do this on characters in the 128-255 >> range, but there would need to be more extensive code changes, and >> extra tests, so I don't think that this is worth doing.
> > > Sounds like a fair win, overall. There are a significant number of > words in Latin-based languages that start with unaccented (ASCII) first > letters. > > Regarding the 128-255 range, it would be possible (I think) to make a > case shifting table indexed by ord, that contained (1) the case shifted > character (2) 0x0 or 0xff or 0x94 to flag that the table doesn't work > for this character. Or the table entries could be two bytes wide, so a > truly out of range character could be used 0xffff. (The one byte > suggestions are thought to be unlikely to true character strings that > get passed to these functions, so a performance hit when encountered > wouldn't be onerous.)
That's similar to what I did. Show quoted text
> > Regarding the 0-128 range, it is possible to do the OR to lowercase and > AND to uppercase, if you check the range to be letters. Not clear that > such is faster, only smaller. The table would be more general, and > faster, and work for some of the non-ASCII. >
I don't understand what you're saying here, but I use the existing code to handle characters in this range, which comes down to testing if 'A' <= c <= 'Z' on ASCII machines and then adding 32 to get the lower case; similar for lower case going the other way. A table lookup is about the same amount of machine work. Show quoted text
> The table for ucfirst & lcfirst, could also be used for uc and lc, no? > You don't mention if that would be a win in those cases, or maybe they > already have a table? Granted it only helps people that operate in the > ASCII (or maybe Latin-1) range, but there are a lot of them. > >
I ended up doing it for all the functions, and including the range 128-255 without going to the general Unicode functions. The expense for characters above 255 is 2 comparisons. The payoff for those less is significant.
CC: Juerd Waalboer <juerd [...] convolution.nl>
Subject: Re: [perl #58182] Unicode bug: code review request
Date: Sat, 22 Nov 2008 16:14:20 -0700
To: Perl5 Porters <perl5-porters [...] perl.org>, Rafael Garcia-Suarez <rgarciasuarez [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.8k
karl williamson wrote: Show quoted text
> Attached for code review are changes to pp.c to enable case handling of > characters in the range 128-255. This is not a patch yet. Since I'm > new to changing the perl source, I wanted to get feedback before > submitting a real patch. Also, changes in a couple of other files > depend on some unresolved issues. > > There are two attachments, one is in standard patch format. The other > is an html file containing another type of diff that I prefer. In it, > the cyan background is for deleted things, the yellow for added; changes > in line indentation are not shown. > > I'm wondering what sorts of things I might be overlooking. I do know > that I don't know much about how overloading or magic might affect these > routines. I don't think my changes would affect these, but then, I > don't know much about these. > > There are a couple of comments marked TODO, which means I have special > questions about them. > > One of my design goals was to not slow things down unnecessarily. I > think that, if anything, I have sped things up. One area of concern I > have, though, in that regard is in the loop in pp_uc. I don't know > anything about modern optimizers. Before, it was a very tight loop, > which an optimizer could reasonably handle. Now, the mainstream is a > tight loop, but there is a conditional in it which can cause a > significant amount of code to be executed, that could fool the > optimizer. Perhaps the non-mainstream case should be put in a function. > > I added some code to improve the efficiency of utf8 handling, so that if > perl has hard-coded into it the upper and lower cases of a character, it > doesn't have to go out to the general Unicode functions. > > The most changes (as opposed to additions) are in uc_first(). Most of > these come from rearranging the code so that in all cases the changed > character is known before processing the rest of the string. Previously, > only if it was encoded in utf-8 would this preliminary step be done. > Otherwise, it was done as it went along. > > An earlier author contemplated combining lc and uc into one function. I > haven't done that, yet. At the expense of two extra comparisons per > function call, I could save, perhaps as much space in perl as I've used > up by adding the code to do the new functionality. > > I have added several code-generating macros. Normally, I don't like > these, but I think it makes things more readable here. > > I have added many more comments than are typical in the Perl source. I > earlier got feedback that this might be a good thing. > > If I don't get any feedback, I'll end up submitting a patch. This does > pass all regression tests. With the new behavior enabled it fails one, > which I have gotten permission from the author to change. > > If you were to try compiling this, you would get missing symbol errors, > from macros in headers. Several of these should be obvious what they > mean, but here are definitions of ones that may not be: > > IN_UNI_8_BIT is true if and only if characters in the range 128-255 are > to be treated as having upper and lower case as defined by Unicode. When > false, these routines should deliver identical results as they always have. > > toLOWER_LATIN1(c) takes a character in the range 0-255 and returns its > lower case as defined by Unicode. > > toUPPER_LATIN1_MOD(c) takes a character in the range 0-255 and returns > its upper case as defined by Unicode, except for 3 tricky characters, > for which it returns a modified value, as explained in the code's comments. > > UTF8_TWO_BYTE_HI(c) returns the first utf8-encoded byte for a Unicode > character c whose utf8 is known to take exactly two bytes. Similarly > for UTF8_TWO_BYTE_LO >
The email that got through the filters stripped off my html file, I presume for security reasons.
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode bug: More questions about coding
Date: Sat, 22 Nov 2008 16:27:03 -0800
To: karl williamson <public [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 9.3k
On approximately 11/22/2008 2:54 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> Glenn Linderman wrote:
>> On approximately 11/18/2008 5:55 PM, came the following characters >> from the keyboard of karl williamson:
>>> I'm almost ready to submit my proposed changes for the uc(), >>> lcfirst(), etc. functions for code review. But I have several more >>> questions.
>> >> >> I have no answers, only comments to be sure that you have considered >> things, and some opinions. >> >>
>>> These functions are all in pp.c. >>> >>> Currently, if in a "use bytes" scope these functions treat the data >>> as strict ASCII, and change the case accordingly. Someone earlier >>> suggested that this is a bug, that this mode is really for binary >>> data only, and that the case should not change in this mode. What >>> should I do?
>> >> >> I think the two options are: >> >> 1) make these functions noops in "use bytes" mode, bringing them in >> line with the "use bytes" documentation. >> >> 2) case convert only ASCII letters [A-Za-z] (no code change) and >> change the "use bytes" documentation to admit that these functions do >> affect ASCII characters even in "use bytes" mode. >> >> The second case is more compatible with today's actual (vs. >> documented) behavior. Noops can also be obtained in other ways, so >> allowing them to continue to operate on the ASCII charcaters gives >> more options. >>
> > In thinking about this some more, it seems like if we wanted to make > things noops in bytes mode, the place to do it would be the parser (or > whatever it is that sets up the execution stack) so that functions that > are noops aren't even called.
Maybe so. Would be a more efficient noop that way. But what I meant, is that if the user wants a noop, they wouldn't generally code it as use bytes; $foo = uc( $foo ) no bytes; Show quoted text
>>> There are a couple cases where a string has to be converted to utf8. >>> bytes_to_utf8() assumes the worst case that the new string will >>> occupy 2n+1 bytes, and allocates a new scalar with that size. The >>> code in these functions check every time through the processing >>> characters loop to see if more space is needed, and if so grows the >>> scalar by just that amount. (This happens only in Unicode where the >>> worst case may be more than 2n) Which precedent would it be >>> preferable for me to follow when the worst case is 2n?
>> >> >> Hmm. So starting with bytes, the worst case is still 2n, no? But >> that assumes that (1) you need to convert from bytes to UTF-8 and (2) >> that the number of characters in the 128-255 range is significant. >> I'm not sure what you already know at the decision point... I would >> guess that you know only your position in the string, when you first >> realize that it must be lengthened. Since even in Latin-1, most of >> the case shifting stays within Latin-1, it seems unlikely that many >> characters will grow. But if you are forced to convert to UTF-8, then >> ... if you kept track of the number of characters seen so far with the >> high bit set, then at the decision point you would know to allocate 1 >> byte for each ASCII character seen so far, 2 bytes for each byte seen >> so far with the high bit set, and 2 bytes for each character not yet >> processed. That could result in a memory savings vs. using twice the >> total length, yet would avoid repetitive reallocations. >> >> But starting with Unicode, I don't know if there is a rule, but are >> the uppercase and lowercase characters far enough apart, that they >> change the size of their representation? If there are such, it would >> seem to be few, because most of the uppercase and lowercase characters >> of each type are grouped together in the same Unicode block. So the >> current algorithm sounds appropriate for this. >>
> The worst case in Unicode is 3:1. But I decided to choose the worst > case, as that is what is done when a string is upgraded to utf8.
I'm assuming you are saying here that the worst case for a lowercase character converted to uppercase, or an uppercase character converted to lowercase, can be 3:1 (since these are the operations of concern), rather than the worst case conversion of one character byte to one UTF-8 sequence is 3:1 (since I don't think that happens). It is a space vs time tradeoff... and the results are highly dependent on the data being manipulated... So you allocate 3:1 space, if you don't use it, do you give it back, or leave it dangle for the next potential operation? Show quoted text
>>> The ucfirst() and lcfirst() functions are implemented in one function >>> which branches at the crucial moment to do the upper or lower case >>> and then comes back together. Comments in the code ask if the same >>> thing should happen for lc() and uc(). There are now several >>> differences between the two, but the vast majority of these routines >>> is identical. Should I do the combining or let it alone?
>> >> >> You are coding it... you decide. It can be smoked on blead, and >> merged back to 5.10.x only once it is reasonably certain to be correct. >>
> > I still am feeling my way about the change culture here. I've worked on > projects where only a major bug warranted a change--customers had to > live with the ones that management didn't deem major enough. And I've > worked on projects where the developer was God, and could do whatever > they liked. I prefer ones where there is a discussion and consensus as > to what should happen.
My comment was similar to others I've seen here. I'm by no means an insider, although I've been hanging around quite a while. You are looking at the code and doing the work; as long as you have a reasonable justification (like the comment you found) for the change, I think it will fly. Gratuitous changes don't seem to be particularly welcome, but if it makes the code more correct, easier to maintain, shorter, not measurably slower, things seem to be accepted. Show quoted text
>>> Finally, it would be trivial to change ucfirst() and lcfirst() so >>> that if handed a utf8 string in which the first character (the only >>> one being operated on) is in the strict ascii range, then to look up >>> its case change in a compiled-in table instead of going out to the >>> filesystem to look it up, as it must do for the general case. The >>> extra expense when this isn't true is an extra comparison, but if it >>> is true, there is quite a bit of savings. Shall I make this change? >>> An extension could be to even do this on characters in the 128-255 >>> range, but there would need to be more extensive code changes, and >>> extra tests, so I don't think that this is worth doing.
>> >> >> Sounds like a fair win, overall. There are a significant number of >> words in Latin-based languages that start with unaccented (ASCII) >> first letters. >> >> Regarding the 128-255 range, it would be possible (I think) to make a >> case shifting table indexed by ord, that contained (1) the case >> shifted character (2) 0x0 or 0xff or 0x94 to flag that the table >> doesn't work for this character. Or the table entries could be two >> bytes wide, so a truly out of range character could be used 0xffff. >> (The one byte suggestions are thought to be unlikely to true character >> strings that get passed to these functions, so a performance hit when >> encountered wouldn't be onerous.)
> > That's similar to what I did.
>> >> Regarding the 0-128 range, it is possible to do the OR to lowercase >> and AND to uppercase, if you check the range to be letters. Not clear >> that such is faster, only smaller. The table would be more general, >> and faster, and work for some of the non-ASCII. >>
> > I don't understand what you're saying here, but I use the existing code > to handle characters in this range, which comes down to testing if 'A' > <= c <= 'Z' on ASCII machines and then adding 32 to get the lower case; > similar for lower case going the other way. A table lookup is about the > same amount of machine work.
add 32 for numbers in that range is the same as OR 32; sub 32 for numbers in that range is the same as AND ~ 32. Flipping the bit via logical operations, versus doing arithmetic. 6 of one, a half-dozen of the other. Logic operations used to be faster, way back when, because there were no possibility of carries; with today's processors, it is generally one clock for either. Now, though, you've got me not understanding something. If there was "ad hoc" logic to test for a-z A-Z ranges, and it is about the same expense as a table, but that now you've invented a table for the 128-255 range, wouldn't it be simpler overall to use the table for the a-z A-Z ranges also? Show quoted text
>> The table for ucfirst & lcfirst, could also be used for uc and lc, no? >> You don't mention if that would be a win in those cases, or maybe they >> already have a table? Granted it only helps people that operate in >> the ASCII (or maybe Latin-1) range, but there are a lot of them.
> > I ended up doing it for all the functions, and including the range > 128-255 without going to the general Unicode functions. The expense for > characters above 255 is 2 comparisons. The payoff for those less is > significant.
-- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode bug: More questions about coding
Date: Sat, 22 Nov 2008 19:34:21 -0700
To: Glenn Linderman <perl [...] NevCal.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 10.6k
Glenn Linderman wrote: Show quoted text
> On approximately 11/22/2008 2:54 PM, came the following characters from > the keyboard of karl williamson:
>> Glenn Linderman wrote:
>>> On approximately 11/18/2008 5:55 PM, came the following characters >>> from the keyboard of karl williamson:
>>>> I'm almost ready to submit my proposed changes for the uc(), >>>> lcfirst(), etc. functions for code review. But I have several more >>>> questions.
>>> >>> >>> I have no answers, only comments to be sure that you have considered >>> things, and some opinions. >>> >>>
>>>> These functions are all in pp.c. >>>> >>>> Currently, if in a "use bytes" scope these functions treat the data >>>> as strict ASCII, and change the case accordingly. Someone earlier >>>> suggested that this is a bug, that this mode is really for binary >>>> data only, and that the case should not change in this mode. What >>>> should I do?
>>> >>> >>> I think the two options are: >>> >>> 1) make these functions noops in "use bytes" mode, bringing them in >>> line with the "use bytes" documentation. >>> >>> 2) case convert only ASCII letters [A-Za-z] (no code change) and >>> change the "use bytes" documentation to admit that these functions do >>> affect ASCII characters even in "use bytes" mode. >>> >>> The second case is more compatible with today's actual (vs. >>> documented) behavior. Noops can also be obtained in other ways, so >>> allowing them to continue to operate on the ASCII charcaters gives >>> more options. >>>
>> >> In thinking about this some more, it seems like if we wanted to make >> things noops in bytes mode, the place to do it would be the parser (or >> whatever it is that sets up the execution stack) so that functions >> that are noops aren't even called.
> > > Maybe so. Would be a more efficient noop that way. But what I meant, > is that if the user wants a noop, they wouldn't generally code it as > > use bytes; > $foo = uc( $foo ) > no bytes; > >
Sure. What I meant was that the function isn't the place to be coding in noop cases, so I could have answered my own question if I had thought about it. I should leave it do what it always has done in bytes mode, and if someone wants to change perl so that it acts as documented, the place to do so is not the function, but the parser. Show quoted text
>>>> There are a couple cases where a string has to be converted to utf8. >>>> bytes_to_utf8() assumes the worst case that the new string will >>>> occupy 2n+1 bytes, and allocates a new scalar with that size. The >>>> code in these functions check every time through the processing >>>> characters loop to see if more space is needed, and if so grows the >>>> scalar by just that amount. (This happens only in Unicode where the >>>> worst case may be more than 2n) Which precedent would it be >>>> preferable for me to follow when the worst case is 2n?
>>> >>> >>> Hmm. So starting with bytes, the worst case is still 2n, no? But >>> that assumes that (1) you need to convert from bytes to UTF-8 and (2) >>> that the number of characters in the 128-255 range is significant. >>> I'm not sure what you already know at the decision point... I would >>> guess that you know only your position in the string, when you first >>> realize that it must be lengthened. Since even in Latin-1, most of >>> the case shifting stays within Latin-1, it seems unlikely that many >>> characters will grow. But if you are forced to convert to UTF-8, >>> then ... if you kept track of the number of characters seen so far >>> with the high bit set, then at the decision point you would know to >>> allocate 1 byte for each ASCII character seen so far, 2 bytes for >>> each byte seen so far with the high bit set, and 2 bytes for each >>> character not yet processed. That could result in a memory savings >>> vs. using twice the total length, yet would avoid repetitive >>> reallocations. >>> >>> But starting with Unicode, I don't know if there is a rule, but are >>> the uppercase and lowercase characters far enough apart, that they >>> change the size of their representation? If there are such, it would >>> seem to be few, because most of the uppercase and lowercase >>> characters of each type are grouped together in the same Unicode >>> block. So the current algorithm sounds appropriate for this. >>>
>> The worst case in Unicode is 3:1. But I decided to choose the worst >> case, as that is what is done when a string is upgraded to utf8.
> > > I'm assuming you are saying here that the worst case for a lowercase > character converted to uppercase, or an uppercase character converted to > lowercase, can be 3:1 (since these are the operations of concern), > rather than the worst case conversion of one character byte to one UTF-8 > sequence is 3:1 (since I don't think that happens). > > It is a space vs time tradeoff... and the results are highly dependent > on the data being manipulated... > > So you allocate 3:1 space, if you don't use it, do you give it back, or > leave it dangle for the next potential operation? >
The worst case for a single byte to utf8 is 2:1. The worst case for in general changing the case of a utf8 character is 3:1, because of the extra modifiers that go with it. The way the functions (not ones I have touched, by the way) work is that they essentially malloc enough space for the worst case for converting from byte to utf8. Any extra dangles until the scalar's reference count goes to 0, when the entire amount is freed. This extra may be needed if the variable is appended to, say. If the scalar has to grow beyond what is adjacent to the string (and I haven't really looked at this code, but am doing some surmising), a new string is allocated, the original's space is freed, and the scalar now has a different string pointer. Show quoted text
>
>>>> The ucfirst() and lcfirst() functions are implemented in one >>>> function which branches at the crucial moment to do the upper or >>>> lower case and then comes back together. Comments in the code ask >>>> if the same thing should happen for lc() and uc(). There are now >>>> several differences between the two, but the vast majority of these >>>> routines is identical. Should I do the combining or let it alone?
>>> >>> >>> You are coding it... you decide. It can be smoked on blead, and >>> merged back to 5.10.x only once it is reasonably certain to be correct. >>>
>> >> I still am feeling my way about the change culture here. I've worked >> on projects where only a major bug warranted a change--customers had >> to live with the ones that management didn't deem major enough. And >> I've worked on projects where the developer was God, and could do >> whatever they liked. I prefer ones where there is a discussion and >> consensus as to what should happen.
> > > My comment was similar to others I've seen here. I'm by no means an > insider, although I've been hanging around quite a while. You are > looking at the code and doing the work; as long as you have a reasonable > justification (like the comment you found) for the change, I think it > will fly. Gratuitous changes don't seem to be particularly welcome, but > if it makes the code more correct, easier to maintain, shorter, not > measurably slower, things seem to be accepted. > >
>>>> Finally, it would be trivial to change ucfirst() and lcfirst() so >>>> that if handed a utf8 string in which the first character (the only >>>> one being operated on) is in the strict ascii range, then to look up >>>> its case change in a compiled-in table instead of going out to the >>>> filesystem to look it up, as it must do for the general case. The >>>> extra expense when this isn't true is an extra comparison, but if it >>>> is true, there is quite a bit of savings. Shall I make this >>>> change? An extension could be to even do this on characters in the >>>> 128-255 range, but there would need to be more extensive code >>>> changes, and extra tests, so I don't think that this is worth doing.
>>> >>> >>> Sounds like a fair win, overall. There are a significant number of >>> words in Latin-based languages that start with unaccented (ASCII) >>> first letters. >>> >>> Regarding the 128-255 range, it would be possible (I think) to make a >>> case shifting table indexed by ord, that contained (1) the case >>> shifted character (2) 0x0 or 0xff or 0x94 to flag that the table >>> doesn't work for this character. Or the table entries could be two >>> bytes wide, so a truly out of range character could be used 0xffff. >>> (The one byte suggestions are thought to be unlikely to true >>> character strings that get passed to these functions, so a >>> performance hit when encountered wouldn't be onerous.)
>> >> That's similar to what I did.
>>> >>> Regarding the 0-128 range, it is possible to do the OR to lowercase >>> and AND to uppercase, if you check the range to be letters. Not >>> clear that such is faster, only smaller. The table would be more >>> general, and faster, and work for some of the non-ASCII. >>>
>> >> I don't understand what you're saying here, but I use the existing >> code to handle characters in this range, which comes down to testing >> if 'A' <= c <= 'Z' on ASCII machines and then adding 32 to get the >> lower case; similar for lower case going the other way. A table >> lookup is about the same amount of machine work.
> > > add 32 for numbers in that range is the same as OR 32; sub 32 for > numbers in that range is the same as AND ~ 32. Flipping the bit via > logical operations, versus doing arithmetic. 6 of one, a half-dozen of > the other. Logic operations used to be faster, way back when, because > there were no possibility of carries; with today's processors, it is > generally one clock for either. > > Now, though, you've got me not understanding something. > > If there was "ad hoc" logic to test for a-z A-Z ranges, and it is about > the same expense as a table, but that now you've invented a table for > the 128-255 range, wouldn't it be simpler overall to use the table for > the a-z A-Z ranges also? > >
I'm sorry I wasn't very clear. Restated, in compatibility mode the existing macros are used which have the non-ascii chars be caseless, but in the new mode, the table is used for the entire 0-255 range. Show quoted text
>>> The table for ucfirst & lcfirst, could also be used for uc and lc, >>> no? You don't mention if that would be a win in those cases, or maybe >>> they already have a table? Granted it only helps people that operate >>> in the ASCII (or maybe Latin-1) range, but there are a lot of them.
>> >> I ended up doing it for all the functions, and including the range >> 128-255 without going to the general Unicode functions. The expense >> for characters above 255 is 2 comparisons. The payoff for those less >> is significant.
> >
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: [perl #58182] Unicode bug: More questions about coding
Date: Sat, 22 Nov 2008 18:59:35 -0800
To: karl williamson <public [...] khwilliamson.com>
From: Glenn Linderman <perl [...] NevCal.com>
Download (untitled) / with headers
text/plain 3.6k
On approximately 11/22/2008 6:34 PM, came the following characters from the keyboard of karl williamson: Show quoted text
> Glenn Linderman wrote:
>> On approximately 11/22/2008 2:54 PM, came the following characters >> from the keyboard of karl williamson:
>>> Glenn Linderman wrote:
>>>> On approximately 11/18/2008 5:55 PM, came the following characters >>>> from the keyboard of karl williamson:
Show quoted text
>> But what I meant, >> is that if the user wants a noop, they wouldn't generally code it as >> >> use bytes; >> $foo = uc( $foo ) >> no bytes; >> >>
> Sure. What I meant was that the function isn't the place to be coding > in noop cases, so I could have answered my own question if I had thought > about it. I should leave it do what it always has done in bytes mode, > and if someone wants to change perl so that it acts as documented, the > place to do so is not the function, but the parser.
Not sure that I agree that the parser should know that much about what the functions do, but on the other hand, having a function that does nothing in certain lexical situations seems useless, so we've arrived at the same conclusion by different paths... the functions should continue to behave as they did in the use bytes case. Show quoted text
> The worst case for a single byte to utf8 is 2:1. The worst case for in > general changing the case of a utf8 character is 3:1, because of the > extra modifiers that go with it. The way the functions (not ones I have > touched, by the way) work is that they essentially malloc enough space > for the worst case for converting from byte to utf8. Any extra dangles > until the scalar's reference count goes to 0, when the entire amount is > freed. This extra may be needed if the variable is appended to, say. If > the scalar has to grow beyond what is adjacent to the string (and I > haven't really looked at this code, but am doing some surmising), a new > string is allocated, the original's space is freed, and the scalar now > has a different string pointer.
I would surmise the same way you did, for when the space is consumed. For the 3:1 case, I understand what you have done, thanks for clarifying. Whether one should be concerned about the possibility of consuming triple the space need for the typical character string because of case shifting is not my call, certainly, but I'm glad to understand this can happen. For temporary variables, it is kind of a ho-hum situation, a bit of space wasted until they drop out of scope. For variables that might stick around for a while, it could be a concern. Perhaps it is my ignorance, but I know of no way for the programmer to say "OK, this string is now going to be used as the key in a hash that will last the lifetime of the program, and it would be good to make it as short as possible". Such a function could be used for premature optimization, but could save significant space for long term data that has, say, but uppercased for consistent comparison to input values (rather than doing less efficient case-insensitive comparisons). Show quoted text
>> If there was "ad hoc" logic to test for a-z A-Z ranges, and it is >> about the same expense as a table, but that now you've invented a >> table for the 128-255 range, wouldn't it be simpler overall to use the >> table for the a-z A-Z ranges also? >> >>
> I'm sorry I wasn't very clear. Restated, in compatibility mode the > existing macros are used which have the non-ascii chars be caseless, but > in the new mode, the table is used for the entire 0-255 range.
Sounds good. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Subject: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 09 Dec 2009 12:11:00 -0700
To: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.8k
I believe this resolves other bug reports, but haven't had time to look them up. The patch is both attached, and available at: git://github.com/khwilliamson/perl.git branch: matching This patch makes case-sensitive regex matching give the same results regardless of whether the string and/or pattern are in utf8, unless "use legacy 'unicode8bit'" is in effect, in which case it works as before. Since Yves is incommunicado, I took what he had done before Larry's veto and extended and modified it, adding an intermediate way. What that means is that anything that looks like[[:xxx:]] will match only in the ASCII range, or in the current locale, if set. I never heard any controversy about that part of the proposal, and it makes sense to me that a Posix construct should act like the Posix definition says to. \d, \s, and \w (hence \b) and their complements act as before, except that when 8-bit unicode mode is on, they also match appropriately in the 128-255 range. This solves the utf8ness problem, as the Posix never match outside their locale or ascii, so utf8ness doesn't matter; and the others match the same whether utf8 or not. I was surprised at actually how little code was involved. Making Posix always mean Posix simplified things quite a bit. \d doesn't match anything in the 128-255 range, so it did not have to be touched. Essentially, all that had to be done was to create new regnodes for \s, \w, and \b (and complements) that say to match using unicode semantics. Everywhere their parallel nodes are in the code, I added these nodes. When compiling, regcomp checks for being in 8-bit unicode semantics mode, and if so, uses the new node; if not it uses the old node. In execution, regexec uses the old definition when matching the old node, and the new semantics when the match is for the new node. I split [[:word:]] from \w and [[:digit:]] from \d so that they would match using Posix semantics regardless of utf8ness. But that is basically it. Several .t files depended on the legacy behaviors to test edge cases for utf8ness. I added a 'use legacy' to those. Also, several text processing modules can't deal with \s matching a no-break space. I spent too much time trying to learn them to decide if this is a bug or not, finding the one or two lines in each that were at fault. It is a bug if the text can be utf8, which would automatically cause the \s to suddenly match the no-break space. But I wasn't sure which ones are claimed to transparently handle utf8. So, I added a 'use legacy' to the modules, which gives the same behavior as in the past. Several TODOs were accomplished and removed from some regex .t files I took advantage of changing regcomp.c to add a croak when the re has gone insane; I've had it in my development version for some time. It seems to happen when there are too many /\N{...}/ calls in a program.

Message body is not shown because it is too large.

CC: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 21:00:27 +0100
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.2k
2009/12/9 karl williamson <public@khwilliamson.com>: Show quoted text
> I believe this resolves other bug reports, but haven't had time to look them > up. > > The patch is both attached, and available at: > git://github.com/khwilliamson/perl.git > branch: matching > > This patch makes case-sensitive regex matching give the same results > regardless of whether the string and/or pattern are in utf8, unless "use > legacy 'unicode8bit'" is in effect, in which case it works as before. > > Since Yves is incommunicado,
I was heads-uped about this mail, but I've not had any time to respond yet. Sorry. Show quoted text
> I took what he had done before Larry's veto and > extended and modified it, adding an intermediate way.  What that means is > that anything that looks like[[:xxx:]] will match only in the ASCII range, > or in the current locale, if set.  I never heard any controversy about that > part of the proposal, and it makes sense to me that a Posix construct should > act like the Posix definition says to.
This is good IMO, it will allow us to close a number of open tickets. Show quoted text
> \d, \s, and \w (hence \b) and their complements act as before, except that > when 8-bit unicode mode is on, they also match appropriately in the 128-255 > range. > > This solves the utf8ness problem, as the Posix never match outside their > locale or ascii, so utf8ness doesn't matter; and the others match the same > whether utf8 or not. > > I was surprised at actually how little code was involved.  Making Posix > always mean Posix simplified things quite a bit.  \d doesn't match anything > in the 128-255 range, so it did not have to be touched. Essentially, all > that had to be done was to create new regnodes for \s, \w, and \b (and > complements) that say to match using unicode semantics.  Everywhere their > parallel nodes are in the code, I added these nodes.  When compiling, > regcomp checks for being in 8-bit unicode semantics mode, and if so, uses > the new node; if not it uses the old node.  In execution, regexec uses the > old definition when matching the old node, and the new semantics when the > match is for the new node.  I split [[:word:]] from \w and [[:digit:]] from > \d so that they would match using Posix semantics regardless of utf8ness. > > But that is basically it. > > Several .t files depended on the legacy behaviors to test edge cases for > utf8ness.  I added a 'use legacy' to those. > > Also, several text processing modules can't deal with \s matching a no-break > space.  I spent too much time trying to learn them to decide if this is a > bug or not, finding the one or two lines in each that were at fault.  It is > a bug if the text can be utf8, which would automatically cause the \s to > suddenly match the no-break space.  But I wasn't sure which ones are claimed > to transparently handle utf8.  So, I added a 'use legacy' to the modules, > which gives the same behavior as in the past. > > Several TODOs were accomplished and removed from some regex .t files > > I took advantage of changing regcomp.c to add a croak when the re has gone > insane; I've had it in my development version for some time.  It seems to > happen when there are too many /\N{...}/ calls in a program. >
I had a quick review of the patch and what you have done. I have two minor objections, but i dont think they need be seen as roadblocks. First, the problem of qr// raises its head. You construct a pattern one context with your new pragma in effect, and then embed it in another pattern somewhere else and the magicness of the pattern is lost. This is the same problem as with use locale, and personally something I think breaks the general modern model of patterns. However it is better than nothing and modifiers can be leveraged on top of your patch so that is fine IMO. Second, and really this is just another facet of the original problem is that people now need to modify existing code to preserve the existing semantics. If this was controlled by modifier then this wouldnt be necessary as we would just make the default modifier behave as in 5.8.x, also if really necessary we could bifurcate the POSIX stuff into multiple opcodes (old/new behaviour) and resolve any objections to fixing the POSIX opcodes. However my opinion is this is a really good step forward and should be applied to blead. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 21:09:58 +0000
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 943b
demerphq wrote: Show quoted text
>First, the problem of qr// raises its head. You construct a pattern >one context with your new pragma in effect, and then embed it in >another pattern somewhere else and the magicness of the pattern is >lost.
Or vice versa, of course. This means that for a pattern to be freely embeddable you need to avoid all the constructs whose behaviour depend on the pragma. This is no worse than the current situation, where you need to avoid those constructs because they're broken, but for that class of situation it's no better either. And how many people are actually going to think about the pragma when they write their exportable regexps? The existence of a pragma with this class of effects is a problem. I also don't think we should be supporting buggy behaviour by any mechanism. The old behaviour isn't just legacy, it's a bug, and any code that invokes it is buggy. Please just fix the bug, don't bless it. -zefram
CC: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 17:21:54 -0500
To: karl williamson <public [...] khwilliamson.com>
From: jesse <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 5.8k
Karl, Thank you very much for all your work on this. I'll admit that the patch is a bit more extensive than I'd anticipated. At a basic procedural level, any module that's inside cpan/ really needs to be patched upstream in the relevant CPAN distribution and then pulled into blead as they're released to CPAN. That highlights my base concern here -- As of right now, if those changes were pushed into the CPAN distributions, they wouldn't run on any Perl release before 5.11.2. That issue is fixable if we CPAN a version of "feature.pm" designed for older releases of Perl, but it gets increasingly hard to keep "legacy" directives synced between the differing realities of different Perls' concepts of "legacy". Talking to Nicholas about my concerns, he suggested that many of these problems would go away if legacy directives always defaulted to enabled. I know that a number of folks are eager to jettison historical designs that are now considered to have been mistakes, but intentionally breaking backwards-compatibility by inverting default behavior isn't the right thing for us to doing. Would you be comfortable with flopping the 'unicode8bit' legacy default such that users who want the new semantics would use something like: "no legacy 'unicode8bit'" or "use feature 'unicode8bit'; or "use feature ':5.12'; Thanks, Jesse On Wed, Dec 09, 2009 at 12:11:00PM -0700, karl williamson wrote: Show quoted text
> I believe this resolves other bug reports, but haven't had time to look > them up. > > The patch is both attached, and available at: > git://github.com/khwilliamson/perl.git > branch: matching > > This patch makes case-sensitive regex matching give the same results > regardless of whether the string and/or pattern are in utf8, unless "use > legacy 'unicode8bit'" is in effect, in which case it works as before. > > Since Yves is incommunicado, I took what he had done before Larry's veto > and extended and modified it, adding an intermediate way. What that > means is that anything that looks like[[:xxx:]] will match only in the > ASCII range, or in the current locale, if set. I never heard any > controversy about that part of the proposal, and it makes sense to me > that a Posix construct should act like the Posix definition says to. > > \d, \s, and \w (hence \b) and their complements act as before, except > that when 8-bit unicode mode is on, they also match appropriately in the > 128-255 range. > > This solves the utf8ness problem, as the Posix never match outside their > locale or ascii, so utf8ness doesn't matter; and the others match the > same whether utf8 or not. > > I was surprised at actually how little code was involved. Making Posix > always mean Posix simplified things quite a bit. \d doesn't match > anything in the 128-255 range, so it did not have to be touched. > Essentially, all that had to be done was to create new regnodes for \s, > \w, and \b (and complements) that say to match using unicode semantics. > Everywhere their parallel nodes are in the code, I added these nodes. > When compiling, regcomp checks for being in 8-bit unicode semantics > mode, and if so, uses the new node; if not it uses the old node. In > execution, regexec uses the old definition when matching the old node, > and the new semantics when the match is for the new node. I split > [[:word:]] from \w and [[:digit:]] from \d so that they would match > using Posix semantics regardless of utf8ness. > > But that is basically it. > > Several .t files depended on the legacy behaviors to test edge cases for > utf8ness. I added a 'use legacy' to those. > > Also, several text processing modules can't deal with \s matching a > no-break space. I spent too much time trying to learn them to decide if > this is a bug or not, finding the one or two lines in each that were at > fault. It is a bug if the text can be utf8, which would automatically > cause the \s to suddenly match the no-break space. But I wasn't sure > which ones are claimed to transparently handle utf8. So, I added a 'use > legacy' to the modules, which gives the same behavior as in the past. > > Several TODOs were accomplished and removed from some regex .t files > > I took advantage of changing regcomp.c to add a croak when the re has > gone insane; I've had it in my development version for some time. It > seems to happen when there are too many /\N{...}/ calls in a program.
Show quoted text
> From 65f96077a5c64ea2ebaa200194782540c112fd8d Mon Sep 17 00:00:00 2001 > From: Karl Williamson <khw@khw-desktop.(none)> > Date: Wed, 9 Dec 2009 11:25:36 -0700 > Subject: [PATCH] regex case-sensitive match utf8ness independent > > --- > cpan/Pod-Simple/lib/Pod/Simple/BlackBox.pm | 4 +- > cpan/Test-Harness/lib/TAP/Parser/YAMLish/Reader.pm | 1 + > cpan/podlators/lib/Pod/Text.pm | 1 + > cpan/podlators/lib/Pod/Text/Color.pm | 1 + > cpan/podlators/lib/Pod/Text/Overstrike.pm | 1 + > cpan/podlators/lib/Pod/Text/Termcap.pm | 1 + > dist/Storable/t/downgrade.t | 6 +- > ext/POSIX/t/time.t | 1 + > handy.h | 13 + > lib/legacy.t | 127 +++++++++- > regcomp.c | 269 +++++++++++++++----- > regcomp.h | 53 +++-- > regcomp.sym | 19 +- > regexec.c | 176 ++++++++++---- > regnodes.h | 54 +++- > t/op/sysio.t | 6 +- > t/re/pat_special_cc.t | 1 + > t/re/re_tests | 2 +- > t/re/reg_posixcc.t | 32 ++-- > 19 files changed, 604 insertions(+), 164 deletions(-)
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 22:27:51 +0000
To: jesse <jesse [...] fsck.com>
From: Nicholas Clark <nick [...] ccl4.org>
Download (untitled) / with headers
text/plain 676b
On Wed, Dec 09, 2009 at 05:21:54PM -0500, jesse wrote: Show quoted text
> Talking to Nicholas about my concerns, he suggested that many of these > problems would go away if legacy directives always defaulted to enabled.
With a slightly more "exciting" view, that I don't know Jesse's opinion on, that the intent is that specific legacy behaviours will change, by in the next major release warning, and the one after that not being the default. Although thinking more about that, it means that "legacy" would be slightly hairy, in that the default enabled subset would be different on different (future) major releases. I don't know if that's too hard to explain and teach. Nicholas Clark
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 23:47:45 +0100
To: jesse <jesse [...] fsck.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 842b
2009/12/9 Nicholas Clark <nick@ccl4.org>: Show quoted text
> On Wed, Dec 09, 2009 at 05:21:54PM -0500, jesse wrote: >
>> Talking to Nicholas about my concerns, he suggested that many of these >> problems would go away if legacy directives always defaulted to enabled.
> > With a slightly more "exciting" view, that I don't know Jesse's opinion on, > that the intent is that specific legacy behaviours will change, by in the > next major release warning, and the one after that not being the default. > > Although thinking more about that, it means that "legacy" would be slightly > hairy, in that the default enabled subset would be different on different > (future) major releases. I don't know if that's too hard to explain and teach.
I am firmly convinced the only sane solution is modifiers. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 17:56:26 -0500
To: demerphq <demerphq [...] gmail.com>
From: jesse <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 1.6k
On Wed, Dec 09, 2009 at 11:47:45PM +0100, demerphq wrote: Show quoted text
> 2009/12/9 Nicholas Clark <nick@ccl4.org>:
> > On Wed, Dec 09, 2009 at 05:21:54PM -0500, jesse wrote: > >
> >> Talking to Nicholas about my concerns, he suggested that many of these > >> problems would go away if legacy directives always defaulted to enabled.
> > > > With a slightly more "exciting" view, that I don't know Jesse's opinion on, > > that the intent is that specific legacy behaviours will change, by in the > > next major release warning, and the one after that not being the default. > > > > Although thinking more about that, it means that "legacy" would be slightly > > hairy, in that the default enabled subset would be different on different > > (future) major releases. I don't know if that's too hard to explain and teach.
> > I am firmly convinced the only sane solution is modifiers.
I very much liked your modifier plan when you described it to me. I _can_ see a desire to update Perl's default semantics for this or some other core feature. There are certainly historical decisions that I'd love to see rectified, as they are clearly bugs. What I worry about at night[1] is breaking bugward-compatibility. I don't want 20% of CPAN breaking needlessly on my watch. If we're going to "fix" default semantics, it is imperative that users need to declare a desire for the new behavior. -Jesse [1] Terrifyingly, the dream I remember having as I woke up this morning was about release engineering 5.12.0. Really.[2] [2] Does this job come with health insurance? Does that insurance include coverage for psychological care or psychoactive medication?
CC: Perl5 Porters <perl5-porters [...] perl.org>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 09 Dec 2009 20:55:38 -0700
To: Juerd Waalboer <juerd [...] convolution.nl>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 724b
Juerd Waalboer wrote: Show quoted text
> [snip] > > These "posix" constructs have for a long time been documented as > *equivalent* to \d, \s and \w, with two remarks: [[:space:]] also > includes \cK and [[:word:]] doesn't even exist in POSIX. > > Changing them is as bad as changing the metacharacters. Changing them to > break the equivalency might even be worse. > > Also, note that perlre calls this "POSIX character class **syntax**" > (emphasis mine). > > An even stronger argument is that perlre defines equivalence with > \p{...}, and explicitly mentions that these are Unicode constructs.
Just so everything is exposed, an argument the other way is that Punct is very different between Posix and Unicode, in the ASCII range.
CC: demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 09 Dec 2009 21:16:49 -0700
To: jesse <jesse [...] fsck.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.6k
jesse wrote: Show quoted text
> > > On Wed, Dec 09, 2009 at 11:47:45PM +0100, demerphq wrote:
>> 2009/12/9 Nicholas Clark <nick@ccl4.org>:
>>> On Wed, Dec 09, 2009 at 05:21:54PM -0500, jesse wrote: >>>
>>>> Talking to Nicholas about my concerns, he suggested that many of these >>>> problems would go away if legacy directives always defaulted to enabled.
>>> With a slightly more "exciting" view, that I don't know Jesse's opinion on, >>> that the intent is that specific legacy behaviours will change, by in the >>> next major release warning, and the one after that not being the default. >>> >>> Although thinking more about that, it means that "legacy" would be slightly >>> hairy, in that the default enabled subset would be different on different >>> (future) major releases. I don't know if that's too hard to explain and teach.
>> I am firmly convinced the only sane solution is modifiers.
> > I very much liked your modifier plan when you described it to me. I _can_ > see a desire to update Perl's default semantics for this or some other > core feature. There are certainly historical decisions that I'd love to > see rectified, as they are clearly bugs. What I worry about at night[1] > is breaking bugward-compatibility. I don't want 20% of CPAN breaking > needlessly on my watch. If we're going to "fix" default semantics, it is > imperative that users need to declare a desire for the new behavior. > > -Jesse > > > [1] Terrifyingly, the dream I remember having as I woke up this morning > was about release engineering 5.12.0. Really.[2] > > [2] Does this job come with health insurance? Does that insurance > include coverage for psychological care or psychoactive medication? >
Well clearly there is some controversy now that I had not anticipated. I can empathize, Jesse. I have been a project manager for a few projects. Even when I was working with a bunch of programmers who were getting paid a lot of money to do the "right thing", and I knew them and their work pretty well, it was nerve wracking, especially as the release date crept, nay galloped, up. I can only imagine how much worse it is when you don't really know these volunteer participants. I've come out of retirement to work on this glaring hole in Perl wrt Unicode. It was bigger than I imagined. And I would like to see it fixed, yesterday. Also, anecdotally, I ran into a friend, actually an ex-coworker of mine, who is a linguist, and when I told him what I was doing, he got excited. He'd tried before, and given up working with Perl on Unicode. So now he wants me to tell him as soon as this is available. That said, it is more important to not destabilize existing code. I don't know what the right thing is for 5.12.
CC: jesse <jesse [...] fsck.com>, demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 23:32:32 -0500
To: karl williamson <public [...] khwilliamson.com>
From: jesse <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 1.5k
Show quoted text
> Well clearly there is some controversy now that I had not anticipated.
Indeed. I apologize if I was overly enthusiastic before I understood the destabilizing impact. Show quoted text
> date crept, nay galloped, up. I can only imagine how much worse it is > when you don't really know these volunteer participants.
No worries. My footnotes were intended as humor rather than as a cry of pain. (Though I really did dream about 5.12 release engineering last night.) Show quoted text
> I've come out of retirement to work on this glaring hole in Perl wrt > Unicode. It was bigger than I imagined. And I would like to see it > fixed, yesterday.
I'd love to see it fixed ten years ago. That said, I _am_ absolutely thrilled that it's getting fixed. I just want to make sure that we don't hurt lots of people as we do it. Show quoted text
> That said, it is more important to not destabilize existing code. I > don't know what the right thing is for 5.12.
To reiterate my current thinking: * I like the control that Yves' per-regex modifiers give us. * I don't want to break users' legacy Perl code. * I really do like the correct semantics you've gotten working. Yves modifiers will give us fine-grained control but won't eliminate the need for good defaults. It really does sound like the small tweak to the default legacyness value that Nicholas suggested would eliminate the contention about default behavior. It would mean that new code that declares it wants new semantics would get them and code that says nothing is expecting the traditional behaviour. How's that sound? -Jesse --
CC: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 09 Dec 2009 21:39:00 -0700
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.5k
demerphq wrote: Show quoted text
> 2009/12/9 karl williamson <public@khwilliamson.com>:
>> I believe this resolves other bug reports, but haven't had time to look them >> up. >> >> The patch is both attached, and available at: >> git://github.com/khwilliamson/perl.git >> branch: matching >> >> This patch makes case-sensitive regex matching give the same results >> regardless of whether the string and/or pattern are in utf8, unless "use >> legacy 'unicode8bit'" is in effect, in which case it works as before. >> >> Since Yves is incommunicado,
> > I was heads-uped about this mail, but I've not had any time to respond > yet. Sorry. >
>> I took what he had done before Larry's veto and >> extended and modified it, adding an intermediate way. What that means is >> that anything that looks like[[:xxx:]] will match only in the ASCII range, >> or in the current locale, if set. I never heard any controversy about that >> part of the proposal, and it makes sense to me that a Posix construct should >> act like the Posix definition says to.
> > This is good IMO, it will allow us to close a number of open tickets. >
>> \d, \s, and \w (hence \b) and their complements act as before, except that >> when 8-bit unicode mode is on, they also match appropriately in the 128-255 >> range. >> >> This solves the utf8ness problem, as the Posix never match outside their >> locale or ascii, so utf8ness doesn't matter; and the others match the same >> whether utf8 or not. >> >> I was surprised at actually how little code was involved. Making Posix >> always mean Posix simplified things quite a bit. \d doesn't match anything >> in the 128-255 range, so it did not have to be touched. Essentially, all >> that had to be done was to create new regnodes for \s, \w, and \b (and >> complements) that say to match using unicode semantics. Everywhere their >> parallel nodes are in the code, I added these nodes. When compiling, >> regcomp checks for being in 8-bit unicode semantics mode, and if so, uses >> the new node; if not it uses the old node. In execution, regexec uses the >> old definition when matching the old node, and the new semantics when the >> match is for the new node. I split [[:word:]] from \w and [[:digit:]] from >> \d so that they would match using Posix semantics regardless of utf8ness. >> >> But that is basically it. >> >> Several .t files depended on the legacy behaviors to test edge cases for >> utf8ness. I added a 'use legacy' to those. >> >> Also, several text processing modules can't deal with \s matching a no-break >> space. I spent too much time trying to learn them to decide if this is a >> bug or not, finding the one or two lines in each that were at fault. It is >> a bug if the text can be utf8, which would automatically cause the \s to >> suddenly match the no-break space. But I wasn't sure which ones are claimed >> to transparently handle utf8. So, I added a 'use legacy' to the modules, >> which gives the same behavior as in the past. >> >> Several TODOs were accomplished and removed from some regex .t files >> >> I took advantage of changing regcomp.c to add a croak when the re has gone >> insane; I've had it in my development version for some time. It seems to >> happen when there are too many /\N{...}/ calls in a program. >>
> > I had a quick review of the patch and what you have done. > > I have two minor objections, but i dont think they need be seen as roadblocks. > > First, the problem of qr// raises its head. You construct a pattern > one context with your new pragma in effect, and then embed it in > another pattern somewhere else and the magicness of the pattern is > lost. This is the same problem as with use locale, and personally > something I think breaks the general modern model of patterns. However > it is better than nothing and modifiers can be leveraged on top of > your patch so that is fine IMO.
I'm not sure I follow this. I think what you're saying is that the original pattern is decompiled or thrown away and then recompiled under the new scheme? Show quoted text
> > Second, and really this is just another facet of the original problem > is that people now need to modify existing code to preserve the > existing semantics. If this was controlled by modifier then this > wouldnt be necessary as we would just make the default modifier behave > as in 5.8.x, also if really necessary we could bifurcate the POSIX > stuff into multiple opcodes (old/new behaviour) and resolve any > objections to fixing the POSIX opcodes.
One should be able to change the default modifier, I would hope. Show quoted text
> > However my opinion is this is a really good step forward and should be > applied to blead. > > cheers, > Yves > >
CC: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 09:43:28 +0100
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.9k
2009/12/10 karl williamson <public@khwilliamson.com>: Show quoted text
> demerphq wrote:
>> I have two minor objections, but i dont think they need be seen as >> roadblocks. >> >> First, the problem of qr// raises its head. You construct a pattern >> one context with your new pragma in effect, and then embed it in >> another pattern somewhere else and the magicness of the pattern is >> lost. This is the same problem as with use locale, and personally >> something I think breaks the general modern model of patterns. However >> it is better than nothing and modifiers can be leveraged on top of >> your patch so that is fine IMO.
> > I'm not sure I follow this.  I think what you're saying is that the original > pattern is decompiled or thrown away and then recompiled under the new > scheme?
Yes. Essentially that is how embedding a qr// object into another pattern works. Basically its like a C include. When you do: my $qr1= qr/this is a pattern/; my $qr2= qr/this is a pattern containing another pattern $qr1/; the *source* of $qr1 is embedded in $qr2, not the opcodes. So any behaviour that is controlled that by pragmatta and not by in-regex modifiers will be lost. This is why the /msix modifiers have (?msix: ... ) forms. Show quoted text
>> >> Second, and really this is just another facet of the original problem >> is that people now need to modify existing code to preserve the >> existing semantics. If this was controlled by modifier then this >> wouldnt be necessary as we would just make the default modifier behave >> as in 5.8.x, also if really necessary we could bifurcate the POSIX >> stuff into multiple opcodes (old/new behaviour) and resolve any >> objections to fixing the POSIX opcodes.
> > One should be able to change the default modifier, I would hope.
Yes, i was thinking that a good plan would be to just define a generic interface for specifying default modifiers. The people that like to follow PBP recommendation of using /msx always can use the pragma. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 12:50:59 +0100
To: jesse <jesse [...] fsck.com>
From: Gerard Goossen <gerard [...] ggoossen.net>
Download (untitled) / with headers
text/plain 581b
On Wed, Dec 09, 2009 at 05:21:54PM -0500, jesse wrote: Show quoted text
> [...] > > That issue is fixable if we CPAN a version of "feature.pm" designed for > older releases of Perl, but it gets increasingly hard to keep "legacy" > directives synced between the differing realities of different Perls' > concepts of "legacy".
Publishing "legacy" shouldn't be too much of a problem. If we add for each directive, a version number in which it is introduced and optional a version number where the directive is stopped being supported, using this to decide what to do would be easy. Gerard Goossen
CC: karl williamson <public [...] khwilliamson.com>, demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 12:52:45 +0100
To: jesse <jesse [...] fsck.com>
From: Gerard Goossen <gerard [...] ggoossen.net>
Download (untitled) / with headers
text/plain 1.6k
What I am missing in the dicussion is that on average exists code would be improved by chaning the semantics, and thus instead of thinking about possibly breaking 20% of CPAN we are fixing 80% of CPAN. If we want this to be the default at any time in the future, we should do it now, because I don't see how having another release cycle would change anything. More specific about the failures caused by the changes: The pod stuff is breaking because it expect a non-breakable-space to be matched by \s, as far as I know it is about the only module expecting this behaviour (which is probably broken because it currently depends on the utf8-ness of the scalar). I did a similar change in Perl Kurila and what I remember is that only the pod module had problems with it. I'll check whether I can find the changes to the pod module, which make them work without using the "use legacy 'unicode8bit'". I am suprised at the failure of Test::Harness, if anything I would expect it to fix it, looking at ...\YAMList\Reader.pm it uses \s to match space characters, but according to YAML a non-breaking-space isn't a space (and thus it would be part of 80% of CPAN which would be fixed by the change). Karl: could you find out why it fails? I suspect that there is something having some (unwanted) side effect (which probably isn't wrong or shouldn't have any effect on code, but might be easily prevented). Another class of failures are those that depend on the current behaviour to test the internals, like the POSIX/t/time.t test, which uses the current behaviour to test that utf8-flag is not set, this is simply broken, and it should simply use utf8::is_utf8. Gerard Goossen
CC: Perl5 Porters <perl5-porters [...] perl.org>, yves.orton [...] booking.com, Glenn Linderman <perl [...] NevCal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Wed, 9 Dec 2009 20:23:27 +0100
To: karl williamson <public [...] khwilliamson.com>
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 1.1k
karl williamson skribis 2009-12-09 12:11 (-0700): Show quoted text
> Since Yves is incommunicado, I took what he had done before Larry's veto > and extended and modified it, adding an intermediate way. What that > means is that anything that looks like[[:xxx:]] will match only in the > ASCII range, or in the current locale, if set. I never heard any > controversy about that part of the proposal, and it makes sense to me > that a Posix construct should act like the Posix definition says to.
These "posix" constructs have for a long time been documented as *equivalent* to \d, \s and \w, with two remarks: [[:space:]] also includes \cK and [[:word:]] doesn't even exist in POSIX. Changing them is as bad as changing the metacharacters. Changing them to break the equivalency might even be worse. Also, note that perlre calls this "POSIX character class **syntax**" (emphasis mine). An even stronger argument is that perlre defines equivalence with \p{...}, and explicitly mentions that these are Unicode constructs. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl>
CC: jesse <jesse [...] fsck.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 13:16:04 +0100
To: Gerard Goossen <gerard [...] ggoossen.net>
From: demerphq <demerphq [...] gmail.com>
2009/12/10 Gerard Goossen <gerard@ggoossen.net>: Show quoted text
> What I am missing in the dicussion is that on average exists code > would be improved by chaning the semantics, and thus instead of > thinking about possibly breaking 20% of CPAN we are fixing 80% of > CPAN. >
Yes I agree. And those crying "yeah but it was documented to behave like X so changing it is bad" have to accept that it DOESNT work like "X" and CANT work like "X" without being buggy. Also, in many cases relating to this subject the docs are just wrong, or misleading or whatever. However it should be remembered that going from the non-unicode world to the unicode one involves breaking a lot previous system invariants (aka laws of the universe). For instance in ascii you can always assume that you can put a case modified version of a string in the same storage as its original, this is most definitely NOT safe in unicode and will/can/could result in buffer overruns as a consequence. Thats just a start. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 13:23:51 +0100
To: Juerd Waalboer <juerd [...] convolution.nl>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.9k
2009/12/9 Juerd Waalboer <juerd@convolution.nl>: Show quoted text
> karl williamson skribis 2009-12-09 12:11 (-0700):
>> Since Yves is incommunicado, I took what he had done before Larry's veto >> and extended and modified it, adding an intermediate way.  What that >> means is that anything that looks like[[:xxx:]] will match only in the >> ASCII range, or in the current locale, if set.  I never heard any >> controversy about that part of the proposal, and it makes sense to me >> that a Posix construct should act like the Posix definition says to.
> > These "posix" constructs have for a long time been documented as > *equivalent* to \d, \s and \w, with two remarks: [[:space:]] also > includes \cK and [[:word:]] doesn't even exist in POSIX.
*mis*documented. And, [[:word:]] is spelled [[:alnum:]]. Show quoted text
> > Changing them is as bad as changing the metacharacters. Changing them to > break the equivalency might even be worse.
I very very very much doubt it, and consider this to be essentially FUD. Especially as it fixes a stack of bugs related to their behaviour now. You cannot have both the current behaviour and non buggy implementation. Simply put I consider that: [^STUFF] matching the same code points as [STUFF] to be an irrefutable and overwhelming reason why the current behavior of POSIX charclass cannot be preserved. Essentially for me this bug ends ANY debate on this particular issue. Had we known of this violation of the rules we NEVER would have allowed this to escape in the wild. Show quoted text
> Also, note that perlre calls this "POSIX character class **syntax**" > (emphasis mine). > > An even stronger argument is that perlre defines equivalence with > \p{...}, and explicitly mentions that these are Unicode constructs.
*mis*documented as equivalent. At least one of the "equivalencies" was *never* true, and the other equivalencies were by breaking unicode rules to be more perl like. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 14:11:57 +0100
To: Juerd Waalboer <juerd [...] convolution.nl>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.5k
2009/12/10 Juerd Waalboer <juerd@convolution.nl>: Show quoted text
> demerphq skribis 2009-12-10 13:23 (+0100):
>> And, [[:word:]] is spelled [[:alnum:]].
> > juerd@lanova:~$ perl -le'print "foo" =~ /[[:word:]]/' > 1 > > See perlre
See regexec.c and regcomp.c for the source of our mutual confusion. Show quoted text
>> You cannot have both the current behaviour and non buggy implementation.
> > Fully agreed. That's certainly not what I'm after, either. >
>> Simply put I consider that: >> [^STUFF] matching the same code points as [STUFF] to be an irrefutable >> and overwhelming reason why the current behavior of POSIX charclass >> cannot be preserved.
> > What exactly do you mean by "current behaviour"? > > To fix the issue that codepoints 128..255 are included depending on > internal encoding, there are two options: > > - Ignore anything above 127 > - Provide full unicode semantics. > > The first, ASCII-only, would be a mistake.
No it wouldnt. There are no "unicode semantics" for POSIX. It is a fundamental error to speak of there being any. Show quoted text
> Perhaps there is other current behaviour that I am not aware of.
Apparently my hint wasnt strong enough. Try matching all the legal codepoints against [^POSIX] and against [POSIX] And note all the cases where you have both matching. Then do it with the strings in unicode. Note all the errors. These are fundamental errors. For me this debate is over, POSIX charclasses are not Unicode charclasses and any contortion to try to make them so is futile and doomed to screw stuff over. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 13:10:23 +0000
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 157b
demerphq wrote: Show quoted text
>And, [[:word:]] is spelled [[:alnum:]].
No, they're different. /\w/ and /[[:word:]]/ match "_", whereas /[[:alnum:]]/ does not. -zefram
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 14:23:28 +0100
To: Zefram <zefram [...] fysh.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 406b
2009/12/10 Zefram <zefram@fysh.org>: Show quoted text
> demerphq wrote:
>>And, [[:word:]] is spelled [[:alnum:]].
> > No, they're different.  /\w/ and /[[:word:]]/ match "_", whereas > /[[:alnum:]]/ does not.
Yes, mea-culpa. They are both called ALNUM internally, well [[:word:]] is internally called ALNUM, and [[:alnum:]] is helpfully called ALNUMC. Sigh. yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 18:57:21 +0000
To: Juerd Waalboer <juerd [...] convolution.nl>
From: John <john.imrie [...] vodafoneemail.co.uk>
Download (untitled) / with headers
text/plain 1.3k
Juerd Waalboer wrote: Show quoted text
> karl williamson skribis 2009-12-09 12:11 (-0700): >
>> Since Yves is incommunicado, I took what he had done before Larry's veto >> and extended and modified it, adding an intermediate way. What that >> means is that anything that looks like[[:xxx:]] will match only in the >> ASCII range, or in the current locale, if set. I never heard any >> controversy about that part of the proposal, and it makes sense to me >> that a Posix construct should act like the Posix definition says to. >>
> > These "posix" constructs have for a long time been documented as > *equivalent* to \d, \s and \w, with two remarks: [[:space:]] also > includes \cK and [[:word:]] doesn't even exist in POSIX. > > Changing them is as bad as changing the metacharacters. Changing them to > break the equivalency might even be worse. > > Also, note that perlre calls this "POSIX character class **syntax**" > (emphasis mine). > > An even stronger argument is that perlre defines equivalence with > \p{...}, and explicitly mentions that these are Unicode constructs. >
Could we then make [:Unicode property name:] map to \p{Unicode property name} and not just limit ourselves to the names list? Show quoted text
______________________________________________ This email has been scanned by Netintelligence http://www.netintelligence.com/email
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, yves.orton [...] booking.com, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Thu, 10 Dec 2009 14:00:56 +0100
To: demerphq <demerphq [...] gmail.com>
From: Juerd Waalboer <juerd [...] convolution.nl>
demerphq skribis 2009-12-10 13:23 (+0100): Show quoted text
> And, [[:word:]] is spelled [[:alnum:]].
juerd@lanova:~$ perl -le'print "foo" =~ /[[:word:]]/' 1 See perlre Show quoted text
> You cannot have both the current behaviour and non buggy implementation.
Fully agreed. That's certainly not what I'm after, either. Show quoted text
> Simply put I consider that: > [^STUFF] matching the same code points as [STUFF] to be an irrefutable > and overwhelming reason why the current behavior of POSIX charclass > cannot be preserved.
What exactly do you mean by "current behaviour"? To fix the issue that codepoints 128..255 are included depending on internal encoding, there are two options: - Ignore anything above 127 - Provide full unicode semantics. The first, ASCII-only, would be a mistake. Perhaps there is other current behaviour that I am not aware of. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl>
CC: demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Glenn Linderman <perl [...] nevcal.com>
Subject: Re: POSIX-like syntax or full compliancy? (Was: PATCH: partial [perl #58182] ...)
Date: Fri, 11 Dec 2009 12:15:13 +0100
To: karl williamson <public [...] khwilliamson.com>
From: Juerd Waalboer <juerd [...] convolution.nl>
Download (untitled) / with headers
text/plain 747b
karl williamson skribis 2009-12-10 22:29 (-0700): Show quoted text
> It was our intention that 5.12 would use strict Posix definitions > rigourously for all these,
In that case I'm entirely fine with the change, provided of course that perldelta documents the change as such. Show quoted text
> except the perl made-up extension, [[:Word:]], which has no Posix > definition.
Changing the bracket expressions to strict POSIX semantics is an incompatible change. Why keep [:word:]? Not that I really mind, but strict interpretation usually doesn't come with exceptions. -- Met vriendelijke groet, Kind regards, Korajn salutojn, Juerd Waalboer: Perl hacker <#####@juerd.nl> <http://juerd.nl/sig> Convolution: ICT solutions and consultancy <sales@convolution.nl>
CC: jesse <jesse [...] fsck.com>, demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Fri, 11 Dec 2009 12:20:06 -0700
To: Gerard Goossen <gerard [...] ggoossen.net>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.5k
Gerard Goossen wrote: Show quoted text
> What I am missing in the dicussion is that on average exists code > would be improved by chaning the semantics, and thus instead of > thinking about possibly breaking 20% of CPAN we are fixing 80% of > CPAN. > > If we want this to be the default at any time in the future, we should > do it now, because I don't see how having another release cycle would > change anything.
I'm thinking that if we make it not the default now, that it would give people a chance to switch to it if they want; and a chance for module authors to check their code. If they don't, well, they did have a chance, as opposed to us springing it on them with no time for reaction. Show quoted text
> > More specific about the failures caused by the changes: > > The pod stuff is breaking because it expect a non-breakable-space to > be matched by \s, as far as I know it is about the only module > expecting this behaviour (which is probably broken because it > currently depends on the utf8-ness of the scalar). I did a similar > change in Perl Kurila and what I remember is that only the pod module > had problems with it. I'll check whether I can find the changes to the > pod module, which make them work without using the "use legacy > 'unicode8bit'".
How much of CPAN did you actually try on Kurila? I actually did find the lines that needed changing in all the modules except Test::Harness. They were in the wrap functions, and in some cases, another one as well. I was starting to fix them there, but realized I didn't know enough about what their input character set domain was supposed to be. Show quoted text
> > I am suprised at the failure of Test::Harness, if anything I would > expect it to fix it, looking at ...\YAMList\Reader.pm it uses \s to > match space characters, but according to YAML a non-breaking-space > isn't a space (and thus it would be part of 80% of CPAN which > would be fixed by the change). > > Karl: could you find out why it fails? I suspect that there is > something having some (unwanted) side effect (which probably isn't > wrong or shouldn't have any effect on code, but might be easily > prevented).
I actually don't feel I have the time to spend on this. The test that failed talked about Unprintables, and the failure was with the no break space. Show quoted text
> > Another class of failures are those that depend on the current > behaviour to test the internals, like the POSIX/t/time.t test, which > uses the current behaviour to test that utf8-flag is not set, this is > simply broken, and it should simply use utf8::is_utf8. > > Gerard Goossen >
CC: jesse <jesse [...] fsck.com>, demerphq <demerphq [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, Nicholas Clark <nick [...] ccl4.org>
Subject: Re: PATCH: partial [perl #58182]: regex case-sensitive matching now utf8ness independent
Date: Sun, 13 Dec 2009 11:56:45 -0700
To: Gerard Goossen <gerard [...] ggoossen.net>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.9k
karl williamson wrote: Show quoted text
> Gerard Goossen wrote:
>> What I am missing in the dicussion is that on average exists code >> would be improved by chaning the semantics, and thus instead of >> thinking about possibly breaking 20% of CPAN we are fixing 80% of >> CPAN. >> >> If we want this to be the default at any time in the future, we should >> do it now, because I don't see how having another release cycle would >> change anything.
> > I'm thinking that if we make it not the default now, that it would give > people a chance to switch to it if they want; and a chance for module > authors to check their code. If they don't, well, they did have a > chance, as opposed to us springing it on them with no time for reaction.
>> >> More specific about the failures caused by the changes: >> >> The pod stuff is breaking because it expect a non-breakable-space to >> be matched by \s, as far as I know it is about the only module >> expecting this behaviour (which is probably broken because it >> currently depends on the utf8-ness of the scalar). I did a similar >> change in Perl Kurila and what I remember is that only the pod module >> had problems with it. I'll check whether I can find the changes to the >> pod module, which make them work without using the "use legacy >> 'unicode8bit'".
> > How much of CPAN did you actually try on Kurila? > > I actually did find the lines that needed changing in all the modules > except Test::Harness. They were in the wrap functions, and in some > cases, another one as well. I was starting to fix them there, but > realized I didn't know enough about what their input character set > domain was supposed to be.
>> >> I am suprised at the failure of Test::Harness, if anything I would >> expect it to fix it, looking at ...\YAMList\Reader.pm it uses \s to >> match space characters, but according to YAML a non-breaking-space >> isn't a space (and thus it would be part of 80% of CPAN which >> would be fixed by the change). >> >> Karl: could you find out why it fails? I suspect that there is >> something having some (unwanted) side effect (which probably isn't >> wrong or shouldn't have any effect on code, but might be easily >> prevented).
> > I actually don't feel I have the time to spend on this. The test that > failed talked about Unprintables, and the failure was with the no break > space.
>>
I had some more insight about this. I believe it is a bug in the test. I changed the order so that the no break space wasn't first on the line, and it passed. There is probably a s/^\s+// line in the module, and it is reasonable for that to strip off a leading no-break space. But the test assumes that it shouldn't. Show quoted text
>> Another class of failures are those that depend on the current >> behaviour to test the internals, like the POSIX/t/time.t test, which >> uses the current behaviour to test that utf8-flag is not set, this is >> simply broken, and it should simply use utf8::is_utf8. >> >> Gerard Goossen >>
>
CC: Juerd Waalboer <juerd [...] convolution.nl>, demerphq <demerphq [...] gmail.com>, Glenn Linderman <perl [...] NevCal.com>, Michael G Schwern <schwern [...] pobox.com>
Subject: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 11 May 2010 12:51:22 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.7k
make regen is required. Also, this patches a .t in Test::Simple; I'm cc'ing the cpan maintainer. The attached series of commits fix the inconsistent handling of Latin1 characters in matching \s, \w, and hence \b (boundary matching) and their complements. This solves the second of the 5 areas of the "Unicode Bug". (The first, lc(), ucfirst(), ... was fixed for 5.12. Those remaining are matching POSIX character classes, matching /i, and user-defined case mappings.) These commits also add regex modifiers /u (unicode), /l (locale), and /t (traditional). /a is not part of this patch. I have made up the term "Matching mode" to describe this. I'm open to a better term, if you can think of one. Much of this patch was submitted and withdrawn last year. It has a somewhat cleaner implementation than that one, in that no new regnodes were added. Instead, it turns out that the flags field in the affected regnodes was unused. By using that, we fly under the radar of some other code, which as a result didn't have to change. Note that there is a behavior change that may be incompatible with existing code. Previously, if a regex is compiled from within 'use locale', and then interpolated into another regex outside it, the localeness of the interpolated part is lost. And vice versa. This patch causes the regex to remember how it was compiled, so it stays with it even when interpolated. Also, the stringification of a regex will show its matching mode modifier, e.g., 't', so code that looks at that will have to change. Several of the .t changes are because of this, and because the minimum length of this changed. For example, (?t-xism:...) with this patch, instead of (?-xism:...) before it. I'm working on the pod changes, and will submit them later.
CC: Juerd Waalboer <juerd [...] convolution.nl>, demerphq <demerphq [...] gmail.com>, Glenn Linderman <perl [...] NevCal.com>, Michael G Schwern <schwern [...] pobox.com>
Subject: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 11 May 2010 12:54:01 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.8k
Oops. All that work, and then I forgot to attach the patches. Now doing so. make regen is required. Also, this patches a .t in Test::Simple; I'm cc'ing the cpan maintainer. The attached series of commits fix the inconsistent handling of Latin1 characters in matching \s, \w, and hence \b (boundary matching) and their complements. This solves the second of the 5 areas of the "Unicode Bug". (The first, lc(), ucfirst(), ... was fixed for 5.12. Those remaining are matching POSIX character classes, matching /i, and user-defined case mappings.) These commits also add regex modifiers /u (unicode), /l (locale), and /t (traditional). /a is not part of this patch. I have made up the term "Matching mode" to describe this. I'm open to a better term, if you can think of one. Much of this patch was submitted and withdrawn last year. It has a somewhat cleaner implementation than that one, in that no new regnodes were added. Instead, it turns out that the flags field in the affected regnodes was unused. By using that, we fly under the radar of some other code, which as a result didn't have to change. Note that there is a behavior change that may be incompatible with existing code. Previously, if a regex is compiled from within 'use locale', and then interpolated into another regex outside it, the localeness of the interpolated part is lost. And vice versa. This patch causes the regex to remember how it was compiled, so it stays with it even when interpolated. Also, the stringification of a regex will show its matching mode modifier, e.g., 't', so code that looks at that will have to change. Several of the .t changes are because of this, and because the minimum length of this changed. For example, (?t-xism:...) with this patch, instead of (?-xism:...) before it. I'm working on the pod changes, and will submit them later.
Download 0001-Typo.patch
text/plain 640b
From a297b4fe606e09a5e218a7af32fb23517ccf8ca6 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 16:01:13 -0700 Subject: [PATCH] Typo --- universal.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/universal.c b/universal.c index 5a2cddb..e4fe9be 100644 --- a/universal.c +++ b/universal.c @@ -1289,7 +1289,7 @@ XS(XS_re_regexp_pattern) if ((re = SvRX(ST(0)))) /* assign deliberate */ { - /* Housten, we have a regex! */ + /* Houston, we have a regex! */ SV *pattern; STRLEN left = 0; char reflags[6]; -- 1.5.6.3
From 0fc54becc48ebfe38c45d0b4eb2a41529074ec41 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 16:02:14 -0700 Subject: [PATCH] Use sizeof instead of hard-coded array size The array should be declared with its actual size. --- universal.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/universal.c b/universal.c index e4fe9be..97d2f18 100644 --- a/universal.c +++ b/universal.c @@ -1292,7 +1292,7 @@ XS(XS_re_regexp_pattern) /* Houston, we have a regex! */ SV *pattern; STRLEN left = 0; - char reflags[6]; + char reflags[sizeof(INT_PAT_MODS)]; if ( GIMME_V == G_ARRAY ) { /* -- 1.5.6.3
From 9991fad598f21e55e22d23af7a396290a5c5d0cd Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 17:36:46 -0700 Subject: [PATCH] Add tested for corrupted regnode --- regcomp.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/regcomp.c b/regcomp.c index 337f0c4..84c7dc1 100644 --- a/regcomp.c +++ b/regcomp.c @@ -9841,6 +9841,10 @@ Perl_regnext(pTHX_ register regnode *p) if (!p) return(NULL); + if (OP(p) > REGNODE_MAX) { /* regnode.type is unsigned */ + Perl_croak(aTHX_ "Corrupted regexp opcode %d > %d", (int)OP(p), (int)REGNODE_MAX); + } + offset = (reg_off_by_arg[OP(p)] ? ARG(p) : NEXT_OFF(p)); if (offset == 0) return(NULL); -- 1.5.6.3
From a04b6010ab24747974feff2214d22f2c7a08d23a Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 19:19:22 -0700 Subject: [PATCH] Display characters as Unicode for clarity --- t/re/pat_special_cc.t | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/t/re/pat_special_cc.t b/t/re/pat_special_cc.t index 1138cbb..36116b8 100644 --- a/t/re/pat_special_cc.t +++ b/t/re/pat_special_cc.t @@ -37,6 +37,7 @@ sub run_tests { my @plain_complement_failed; for my $ord (0 .. $upper_bound) { my $ch= chr $ord; + my $ord = sprintf "U+%04X", $ord; # For display in Unicode terms my $plain= $ch=~/$special/ ? 1 : 0; my $plain_u= $ch=~/$upper/ ? 1 : 0; push @plain_complement_failed, "$ord-$plain-$plain_u" if $plain == $plain_u; -- 1.5.6.3
From 8d24edd55e96df3ce466ce4888910038db233b4d Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 21:21:37 -0700 Subject: [PATCH] Clarify that count is bytes not unicode characters --- regexec.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/regexec.c b/regexec.c index 17a0dc6..0f67a65 100644 --- a/regexec.c +++ b/regexec.c @@ -2194,7 +2194,7 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *stre RE_PV_QUOTED_DECL(quoted,do_utf8,PERL_DEBUG_PAD_ZERO(1), s,strend-s,60); PerlIO_printf(Perl_debug_log, - "Matching stclass %.*s against %s (%d chars)\n", + "Matching stclass %.*s against %s (%d bytes)\n", (int)SvCUR(prop), SvPVX_const(prop), quoted, (int)(strend - s)); } -- 1.5.6.3

Message body is not shown because it is too large.

Message body is not shown because it is too large.

From b876352daa427fe4a657d39589e9840f9f15bb75 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Tue, 11 May 2010 12:47:57 -0600 Subject: [PATCH] Change Test-Simple .t fix --- cpan/Test-Simple/t/fail-like.t | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/cpan/Test-Simple/t/fail-like.t b/cpan/Test-Simple/t/fail-like.t index 0ea5fab..c8162ef 100644 --- a/cpan/Test-Simple/t/fail-like.t +++ b/cpan/Test-Simple/t/fail-like.t @@ -48,7 +48,7 @@ OUT # Failed test 'is foo like that' # at .* line 1\. # 'foo' -# doesn't match '\\(\\?-xism:that\\)' +# doesn't match '\\(\\?t-xism:that\\)' ERR $TB->like($err->read, qr/^$err_re$/, 'failing errors'); -- 1.5.6.3
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 12 May 2010 10:48:01 +0100
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 159b
karl williamson wrote: Show quoted text
>Note that there is a behavior change that may be incompatible with >existing code.
Does the parsing of "/foo/lt +1" change? -zefram
CC: Perl5 Porters <perl5-porters [...] perl.org>, Yves Orton <yves.orton [...] booking.com>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 12 May 2010 09:16:17 -0600
To: Zefram <zefram [...] fysh.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 934b
Zefram wrote: Show quoted text
> karl williamson wrote:
>> Note that there is a behavior change that may be incompatible with >> existing code.
> > Does the parsing of "/foo/lt +1" change? > > -zefram >
Yes. When this was being discussed last year, and these modifiers were agreed on, no one mentioned adding modifier letters could conflict with existing syntax. But someone thought of it later, as I remember seeing a .t file patch come through to make sure that things like '/foo/and bar' don't ever change in meaning. In looking at this more, I see existing ambiguities in ge and cmp. I suppose those have always been there, and so no code ever got run without uncovering the problem. It appears that 'l' can't be used as it changes the meaning of '/foo/le +1'. And neither can 't' because of 'gt'. 'h' for historical could be used instead of 't' (I can't think of any conflicts with this), but what could be used to mean locale?
CC: jesse <jesse [...] fsck.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 11:52:37 -0600
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.3k
karl williamson wrote: Show quoted text
> Zefram wrote:
>> karl williamson wrote:
>>> Note that there is a behavior change that may be incompatible with >>> existing code.
>> >> Does the parsing of "/foo/lt +1" change? >> >> -zefram >>
> > Yes. > > When this was being discussed last year, and these modifiers were agreed > on, no one mentioned adding modifier letters could conflict with > existing syntax. But someone thought of it later, as I remember seeing > a .t file patch come through to make sure that things like > '/foo/and bar' don't ever change in meaning. > > In looking at this more, I see existing ambiguities in ge and cmp. I > suppose those have always been there, and so no code ever got run > without uncovering the problem. > > It appears that 'l' can't be used as it changes the meaning of > '/foo/le +1'. > > And neither can 't' because of 'gt'. > > 'h' for historical could be used instead of 't' (I can't think of any > conflicts with this), but what could be used to mean locale? >
In thinking about modifier letters that can't possibly conflict with any existing constructs, I came up with the following: To mean the way it always has worked: 'h' for historical, or 'r' for oRiginal, retro or retarded. To mean locale: 'z' for zone. I asked Yves privately about this, and he wonders if it is worth trying to not break constructs like '/foo/lt +2'
CC: jesse <jesse [...] fsck.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 18:20:15 +0000
To: karl williamson <public [...] khwilliamson.com>
From: Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>
Download (untitled) / with headers
text/plain 645b
On Tue, May 18, 2010 at 17:52, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> I asked Yves privately about this, and he wonders if it is worth trying to > not break constructs like '/foo/lt +2'
I don't think it's worth it. We should just pick the modifier letters that make sense and not bend over backwards to be backwards compatible with a *very* small amount of code out there. We have to weigh that against all the people that have to recall the name of these modifiers in the future. The /k => /p thing was unfortunate enough, but that was for different reasons. Did we even consider breaking /foo/p() back then? Not that I recall.
CC: karl williamson <public [...] khwilliamson.com>, jesse <jesse [...] fsck.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 14:29:44 -0400
To: ?var Arnfj?r? Bjarmason <avarab [...] gmail.com>
From: Jesse Vincent <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 696b
On Tue, May 18, 2010 at 06:20:15PM +0000, ?var Arnfj?r? Bjarmason wrote: Show quoted text
> On Tue, May 18, 2010 at 17:52, karl williamson <public@khwilliamson.com> wrote:
> > I asked Yves privately about this, and he wonders if it is worth trying to > > not break constructs like '/foo/lt +2'
> > I don't think it's worth it. We should just pick the modifier letters > that make sense and not bend over backwards to be backwards compatible > with a *very* small amount of code out there.
I'd rather we design a solution that doesn't break backward compatibility, no matter how much code we _think_ it might break. If that turns out to be completely untenable, then we can talk about how we hurt our users.
CC: "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 21:22:47 +0200
To: Jesse Vincent <jesse [...] fsck.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.8k
On 18 May 2010 20:29, Jesse Vincent <jesse@fsck.com> wrote: Show quoted text
> On Tue, May 18, 2010 at 06:20:15PM +0000, ?var Arnfj?r? Bjarmason wrote:
>> On Tue, May 18, 2010 at 17:52, karl williamson <public@khwilliamson.com> wrote:
>> > I asked Yves privately about this, and he wonders if it is worth trying to >> > not break constructs like '/foo/lt +2'
>> >> I don't think it's worth it. We should just pick the modifier letters >> that make sense and not bend over backwards to be backwards compatible >> with a *very* small amount of code out there.
> > I'd rather we design a solution that doesn't break backward > compatibility, no matter how much code we _think_ it might break. If > that turns out to be completely untenable, then we can talk about how we > hurt our users.
I have to say that while I tend to agree with you in general on this one I think the boat already sailed. For instance, what characters can follow an s/// expression and a m// expression vary. So, on one hand we have stuff like: m/..../ge +1 which is a syntax error, but m/..../le +1 is a valid expression, yet s/...//ge +1 is interpreted as (s/...//ge) (+1). Given how muddy the waters already are IMO this type of breakage is only of interest to golfers. And is likely to involve only a mere handful of scripts who probably aren't going to upgrade anyway. Anyone sane puts a space after a regex and its modifiers anyway IMO. Anyway, if we really are going to care about this now I really think we aught to introduce deprecation warning when a character is encountered immediately following a regex pattern (of any sort) that is not a legal modifier so that we dont have to worry about it ever again. And I will note that the introduction of /p was achieved without anyone reporting any problems like this. Cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 21:25:39 +0200
To: Jesse Vincent <jesse [...] fsck.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.1k
On 18 May 2010 21:22, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 18 May 2010 20:29, Jesse Vincent <jesse@fsck.com> wrote:
>> On Tue, May 18, 2010 at 06:20:15PM +0000, ?var Arnfj?r? Bjarmason wrote:
>>> On Tue, May 18, 2010 at 17:52, karl williamson <public@khwilliamson.com> wrote:
>>> > I asked Yves privately about this, and he wonders if it is worth trying to >>> > not break constructs like '/foo/lt +2'
>>> >>> I don't think it's worth it. We should just pick the modifier letters >>> that make sense and not bend over backwards to be backwards compatible >>> with a *very* small amount of code out there.
>> >> I'd rather we design a solution that doesn't break backward >> compatibility, no matter how much code we _think_ it might break. If >> that turns out to be completely untenable, then we can talk about how we >> hurt our users.
> > I have to say that while I tend to agree with you in general on this > one I think the boat already sailed. > > For instance, what characters can follow an s/// expression and a m// > expression vary. > > So, on one hand we have stuff like: > > m/..../ge +1 > > which is a syntax error, but > > m/..../le +1 > > is a valid expression, yet > > s/...//ge +1 > > is interpreted as (s/...//ge) (+1). > > Given how muddy the waters already are IMO this type of breakage is > only of interest to golfers. And is likely to involve only a mere > handful of scripts who probably aren't going to upgrade anyway. Anyone > sane puts a space after a regex and its modifiers anyway IMO. > > Anyway, if we really are going to care about this now I really think > we aught to introduce deprecation warning when a character is > encountered immediately following a regex pattern (of any sort) that > is not a legal modifier so that we dont have to worry about it ever > again. > > And I will note that the introduction of /p was achieved without > anyone reporting any problems like this.
And in fact historically we DID have IMO a more worrying construct added, and I suspect it also was harmless. I speak specifically of the /x modifier. /foo/x 10 would have had a rather different meaning before and after the introduction of "commented pattern mode". Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: jesse <jesse [...] fsck.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 15:27:49 -0400
To: karl williamson <public [...] khwilliamson.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 687b
On Tue, May 18, 2010 at 1:52 PM, karl williamson <public@khwilliamson.com>wrote: Show quoted text
> I asked Yves privately about this, and he wonders if it is worth trying to > not break constructs like '/foo/lt +2' >
(nit: Even if we use "l", breaking "lt" can be avoided by looking ahead for "t" since "t" is not a valid modifier. "le" is a different matter.) Possible solutions that don't require using weird names: - - Have users use /foo/el or /(?l:foo)/e instead of /foo/le if they want locale semantics. - These new modifiers will require Perl 5.14 to compile, so requiring a "use 5.014;" to use them would not add any restrictions and would sidestep the entire dilemma.
CC: Jesse Vincent <jesse [...] fsck.com>, ?var Arnfj?r? Bjarmason <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 15:30:20 -0400
To: demerphq <demerphq [...] gmail.com>
From: Jesse Vincent <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 993b
Show quoted text
> > Given how muddy the waters already are IMO this type of breakage is > > only of interest to golfers. And is likely to involve only a mere > > handful of scripts who probably aren't going to upgrade anyway. Anyone > > sane puts a space after a regex and its modifiers anyway IMO.
That is not an argument I can accept. The "nobody sane would do this" strawman is a great way to get a kneejerk reaction out of me. Show quoted text
> > And I will note that the introduction of /p was achieved without > > anyone reporting any problems like this.
> > And in fact historically we DID have IMO a more worrying construct > added, and I suspect it also was harmless. I speak specifically of the > /x modifier.
"We did it wrong before and nobody freaked out" is the same sort of thing. Neither of those arguments suggest that this is a case when we should break backward compatibility gratuitously. We may well be unable to do this cleanly and sanely. But that's a very different sort of argument. -Jesse
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 20:53:05 +0100
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 501b
demerphq wrote: Show quoted text
>Anyway, if we really are going to care about this now I really think >we aught to introduce deprecation warning when a character is >encountered immediately following a regex pattern (of any sort) that >is not a legal modifier so that we dont have to worry about it ever >again.
We just had that discussion with respect to letters immediately following numeric literals. That (along with the dots after numerics) was what prompted Jesse's paean to backward compatibility. -zefram
CC: "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 22:04:21 +0200
To: Jesse Vincent <jesse [...] fsck.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 647b
On 18 May 2010 21:30, Jesse Vincent <jesse@fsck.com> wrote: Show quoted text
> Neither of those arguments suggest that this is a case when we should > break backward compatibility gratuitously.
We shall have to agree to disagree that this is gratuitous breakage. I think thats going far too far. Show quoted text
> We may well be unable to do this cleanly and sanely. But that's a very > different sort of argument.
Basically we only have to worry about 'l' because of 'le', and 'f' because of 'if'. Any others? And actually it occurs to me that we can safely bypass the whole problem by using capitals :-) cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 19:40:15 -0400
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.1k
On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 18 May 2010 21:30, Jesse Vincent <jesse@fsck.com> wrote:
> > Neither of those arguments suggest that this is a case when we should > > break backward compatibility gratuitously.
> > We shall have to agree to disagree that this is gratuitous breakage. I > think thats going far too far. >
> > We may well be unable to do this cleanly and sanely. But that's a very > > different sort of argument.
> > Basically we only have to worry about 'l' because of 'le', and 'f' > because of 'if'. Any others? >
Not "if". it's already a syntax error because "i" is a valid option. Any of the following immediately following the delimiter are currently valid, but will become a syntax error (e.g. /foo/le+1) or different valid code (e.g. /foo/lt+1): - unless & until from /u - le & lt from /l - [none] from /t We're precluded from using these: - /a (and) - /f (for, foreach) - /n (ne) - /w (when, while) We don't have to worry about these: - cmp - eq - if - ge & gt - or - xor - builtin function - sub names - barewords
CC: "'Jesse Vincent'" <jesse [...] fsck.com>, "'?var Arnfj?r? Bjarmason'" <avarab [...] gmail.com>, "'karl williamson'" <public [...] khwilliamson.com>, "'Perl5 Porters'" <perl5-porters [...] perl.org>
Subject: RE: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 17:03:55 -0700
To: "'Eric Brine'" <ikegami [...] adaelis.com>, "'demerphq'" <demerphq [...] gmail.com>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 1.3k
On Tue, 18 May 2010, Eric Brine wrote: Show quoted text
> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
> > Basically we only have to worry about 'l' because of 'le', and 'f' > > because of 'if'. Any others? > >
> > Not "if". it's already a syntax error because "i" is a valid option. > > Any of the following immediately following the delimiter are currently > valid, but will become a syntax error (e.g. /foo/le+1) or different valid > code (e.g. /foo/lt+1): > > - unless & until from /u > - le & lt from /l > - [none] from /t > > We're precluded from using these: > > - /a (and) > - /f (for, foreach) > - /n (ne) > - /w (when, while) > > We don't have to worry about these: > > - cmp > - eq > - if > - ge & gt > - or > - xor > - builtin function > - sub names > - barewords
Yes, but why bother? What is wrong with your previous suggestion to only allow the new modifiers after a use 5.014; That lets us pick the letters based on mnemonic value instead of having to work around some obscure edge cases. And the code using the new letters will not work on earlier Perl versions anyways, so having the "use 5.014" in there is a good idea anyways. This will also allow to us to turn any currently unused modifiers into syntax errors right away for all 5.14+ code as well without breaking any compatibility. What's not to like? Cheers, -Jan
CC: "'Eric Brine'" <ikegami [...] adaelis.com>, "'demerphq'" <demerphq [...] gmail.com>, "'Jesse Vincent'" <jesse [...] fsck.com>, "'?var Arnfj?r? Bjarmason'" <avarab [...] gmail.com>, "'Perl5 Porters'" <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 18:45:18 -0600
To: Jan Dubois <jand [...] activestate.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.4k
Jan Dubois wrote: Show quoted text
> On Tue, 18 May 2010, Eric Brine wrote:
>> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
>>> Basically we only have to worry about 'l' because of 'le', and 'f' >>> because of 'if'. Any others? >>>
>> Not "if". it's already a syntax error because "i" is a valid option. >> >> Any of the following immediately following the delimiter are currently >> valid, but will become a syntax error (e.g. /foo/le+1) or different valid >> code (e.g. /foo/lt+1): >> >> - unless & until from /u >> - le & lt from /l >> - [none] from /t >> >> We're precluded from using these: >> >> - /a (and) >> - /f (for, foreach) >> - /n (ne) >> - /w (when, while) >> >> We don't have to worry about these: >> >> - cmp >> - eq >> - if >> - ge & gt >> - or >> - xor >> - builtin function >> - sub names >> - barewords
> > Yes, but why bother? What is wrong with your previous suggestion to only allow > the new modifiers after a > > use 5.014; > > That lets us pick the letters based on mnemonic value instead of having to > work around some obscure edge cases. And the code using the new letters > will not work on earlier Perl versions anyways, so having the "use 5.014" > in there is a good idea anyways. > > This will also allow to us to turn any currently unused modifiers into syntax > errors right away for all 5.14+ code as well without breaking any compatibility. > > What's not to like? > > Cheers, > -Jan >
+1
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 18 May 2010 20:53:56 -0400
To: 'Perl5 Porters' <perl5-porters [...] perl.org>
From: Jesse Vincent <jesse [...] fsck.com>
Download (untitled) / with headers
text/plain 726b
On Tue, May 18, 2010 at 06:45:18PM -0600, karl williamson wrote: Show quoted text
> >Yes, but why bother? What is wrong with your previous suggestion to only allow > >the new modifiers after a > > > > use 5.014; > > > >That lets us pick the letters based on mnemonic value instead of having to > >work around some obscure edge cases. And the code using the new letters > >will not work on earlier Perl versions anyways, so having the "use 5.014" > >in there is a good idea anyways. > > > >This will also allow to us to turn any currently unused modifiers into syntax > >errors right away for all 5.14+ code as well without breaking any compatibility. > > > >What's not to like? > > > >Cheers, > >-Jan > >
> +1
Indeed. I'm a fan. --
CC: 'Eric Brine' <ikegami [...] adaelis.com>, 'demerphq' <demerphq [...] gmail.com>, 'Jesse Vincent' <jesse [...] fsck.com>, '?var Arnfj?r? Bjarmason' <avarab [...] gmail.com>, 'karl williamson' <public [...] khwilliamson.com>, 'Perl5 Porters' <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 09:56:59 +0200
To: Jan Dubois <jand [...] activestate.com>
From: Steffen Mueller <smueller [...] cpan.org>
Hi all, Jan Dubois wrote: Show quoted text
> On Tue, 18 May 2010, Eric Brine wrote:
>> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
>>> Basically we only have to worry about 'l' because of 'le', and 'f' >>> because of 'if'. Any others?
[...] Show quoted text
>> Any of the following immediately following the delimiter are currently >> valid, but will become a syntax error (e.g. /foo/le+1) or different valid >> code (e.g. /foo/lt+1): >> >> - unless & until from /u >> - le & lt from /l >> - [none] from /t >> >> We're precluded from using these: >> >> - /a (and) >> - /f (for, foreach) >> - /n (ne) >> - /w (when, while)
I think these are MUCH more likely to be a problem than the three above. Show quoted text
>> We don't have to worry about these:
[...] Show quoted text
> Yes, but why bother? What is wrong with your previous suggestion to only allow > the new modifiers after a > > use 5.014; > > That lets us pick the letters based on mnemonic value instead of having to > work around some obscure edge cases. And the code using the new letters > will not work on earlier Perl versions anyways, so having the "use 5.014" > in there is a good idea anyways.
I couldn't agree more with this last paragraph. Show quoted text
> This will also allow to us to turn any currently unused modifiers into syntax > errors right away for all 5.14+ code as well without breaking any compatibility. > > What's not to like?
Edge cases and action at a distance. Edges cases internally (with any such decision, we start maintaining to branches of behaviour in the same code base). Action at a distance in user code. Of course, use VERSION is lexical, so this isn't action across 1M lines of code, but it may well be across a couple of thousands in a badly written application or module. Don't get me wrong. I'd rather move forward and make behaviour conditional on the use VERSION at the top of the file than not move on at all. But I believe that this is a case of being overzealous regarding backwards compatibility. So I guess overall, this is a +1 to any solution, with use VERSION or without. --Steffen
CC: Jan Dubois <jand [...] activestate.com>, "'Eric Brine'" <ikegami [...] adaelis.com>, "'demerphq'" <demerphq [...] gmail.com>, "'Jesse Vincent'" <jesse [...] fsck.com>, "'?var Arnfj?r? Bjarmason'" <avarab [...] gmail.com>, "'karl williamson'" <public [...] khwilliamson.com>, "'Perl5 Porters'" <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 10:22:57 +0200
To: Steffen Mueller <smueller [...] cpan.org>
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
Download (untitled) / with headers
text/plain 2.7k
On Wed, 19 May 2010 09:56:59 +0200, Steffen Mueller <smueller@cpan.org> wrote: Show quoted text
> Hi all, > > Jan Dubois wrote:
> > On Tue, 18 May 2010, Eric Brine wrote:
> >> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
> >>> Basically we only have to worry about 'l' because of 'le', and 'f' > >>> because of 'if'. Any others?
> > [...] >
> >> Any of the following immediately following the delimiter are currently > >> valid, but will become a syntax error (e.g. /foo/le+1) or different valid > >> code (e.g. /foo/lt+1): > >> > >> - unless & until from /u > >> - le & lt from /l > >> - [none] from /t > >> > >> We're precluded from using these: > >> > >> - /a (and) > >> - /f (for, foreach) > >> - /n (ne) > >> - /w (when, while)
> > I think these are MUCH more likely to be a problem than the three above.
I have used '/pat/and action' a LOT in one-liners, alwyas being aware that 'and' works, and 'or' doesn't Show quoted text
> >> We don't have to worry about these:
> > [...] >
> > Yes, but why bother? What is wrong with your previous suggestion to only allow > > the new modifiers after a > > > > use 5.014; > > > > That lets us pick the letters based on mnemonic value instead of having to > > work around some obscure edge cases. And the code using the new letters > > will not work on earlier Perl versions anyways, so having the "use 5.014" > > in there is a good idea anyways.
> > I couldn't agree more with this last paragraph. >
> > This will also allow to us to turn any currently unused modifiers into syntax > > errors right away for all 5.14+ code as well without breaking any compatibility. > > > > What's not to like?
> > Edge cases and action at a distance. Edges cases internally (with any > such decision, we start maintaining to branches of behaviour in the same > code base). Action at a distance in user code. > > Of course, use VERSION is lexical, so this isn't action across 1M lines > of code, but it may well be across a couple of thousands in a badly > written application or module. > > Don't get me wrong. I'd rather move forward and make behaviour > conditional on the use VERSION at the top of the file than not move on > at all. But I believe that this is a case of being overzealous regarding > backwards compatibility. > > So I guess overall, this is a +1 to any solution, with use VERSION or > without. > > --Steffen
-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20, 11.00, 11.11, 11.23, and 11.31, OpenSuSE 10.3, 11.0, and 11.1, AIX 5.2 and 5.3. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
CC: Eric Brine <ikegami [...] adaelis.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 13:34:34 +0200
To: Jan Dubois <jand [...] activestate.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.6k
On 19 May 2010 02:03, Jan Dubois <jand@activestate.com> wrote: Show quoted text
> On Tue, 18 May 2010, Eric Brine wrote:
>> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
>> > Basically we only have to worry about 'l' because of 'le', and 'f' >> > because of 'if'. Any others? >> >
>> >> Not "if". it's already a syntax error because "i" is a valid option. >> >> Any of the following immediately following the delimiter are currently >> valid, but will become a syntax error (e.g. /foo/le+1) or different valid >> code (e.g. /foo/lt+1): >> >>    - unless & until from /u >>    - le & lt from /l >>    - [none] from /t >> >> We're precluded from using these: >> >>    - /a (and) >>    - /f (for, foreach) >>    - /n (ne) >>    - /w (when, while) >> >> We don't have to worry about these: >> >>    - cmp >>    - eq >>    - if >>    - ge & gt >>    - or >>    - xor >>    - builtin function >>    - sub names >>    - barewords
> > Yes, but why bother?  What is wrong with your previous suggestion to only allow > the new modifiers after a > >    use 5.014; > > That lets us pick the letters based on mnemonic value instead of having to > work around some obscure edge cases.  And the code using the new letters > will not work on earlier Perl versions anyways, so having the "use 5.014" > in there is a good idea anyways. > > This will also allow to us to turn any currently unused modifiers into syntax > errors right away for all 5.14+ code as well without breaking any compatibility. > > What's not to like?
Only nit is i think covered by the -E option which implies a use LATEST; right? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Jan Dubois <jand [...] activestate.com>, Eric Brine <ikegami [...] adaelis.com>, Jesse Vincent <jesse [...] fsck.com>, ?var Arnfj?r? Bjarmason <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 14:34:37 +0200
To: perl5-porters [...] perl.org, demerphq <demerphq [...] gmail.com>
From: Steffen Mueller <smueller [...] cpan.org>
Download (untitled) / with headers
text/plain 270b
Hi Yves, demerphq wrote: Show quoted text
> Only nit is i think covered by the -E option which implies a use LATEST; right?
It does. (NB: Changing its meaning from exactly the same as "use LATEST" would be severely annoying to document/teach/convey/maintain.) Best regards, Steffen
CC: Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 10:32:50 -0400
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 509b
On Wed, May 19, 2010 at 7:34 AM, demerphq <demerphq@gmail.com> wrote: Show quoted text
> Only nit is i think covered by the -E option which implies a use LATEST; > right? >
hum, neither perlrun nor feature mention the problem with backwards compatibility of using -E. Maybe something like the following (but I'm not fully awake yet): Since L<feature> is used to introduce features that are not backwards compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may result in broken code when Perl is upgraded.
CC: Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 16:46:37 +0200
To: Eric Brine <ikegami [...] adaelis.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 741b
On 19 May 2010 16:32, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> On Wed, May 19, 2010 at 7:34 AM, demerphq <demerphq@gmail.com> wrote:
>> >> Only nit is i think covered by the -E option which implies a use LATEST; >> right?
> > hum, neither perlrun nor feature mention the problem with backwards > compatibility of using -E. Maybe something like the following (but I'm not > fully awake yet): > > Since L<feature> is used to introduce features that are not backwards > compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may > result in broken code when Perl is upgraded.
One wonders if we should mention that using one liners in production is not recommended? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Eric Brine <ikegami [...] adaelis.com>, Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 10:11:04 -0600
To: demerphq <demerphq [...] gmail.com>
From: Tom Christiansen <tchrist [...] perl.com>
Download (untitled) / with headers
text/plain 572b
Show quoted text
>>> Only nit is i think covered by the -E option which implies a use LATEST; >>> right?
Show quoted text
>> hum, neither perlrun nor feature mention the problem with backwards >> compatibility of using -E. Maybe something like the following (but I'm not >> fully awake yet):
Show quoted text
>> Since L<feature> is used to introduce features that are not backwards >> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may >> result in broken code when Perl is upgraded.
Show quoted text
> One wonders if we should mention that using one liners in production > is not recommended?
It's not?? --tom
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 17:14:47 +0100
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 431b
Eric Brine wrote: Show quoted text
>Since L<feature> is used to introduce features that are not backwards >compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may >result in broken code when Perl is upgraded.
I thought that was implicit, and very obviously so, in the definition of -E. It's the inherent tradeoff, the price one pays for having such a short shorthand. But I wouldn't object to it being made explicit. -zefram
CC: demerphq <demerphq [...] gmail.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 12:20:35 -0600
To: Eric Brine <ikegami [...] adaelis.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.4k
Eric Brine wrote: Show quoted text
> > On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com > <mailto:demerphq@gmail.com>> wrote: > > On 18 May 2010 21:30, Jesse Vincent <jesse@fsck.com > <mailto:jesse@fsck.com>> wrote:
> > Neither of those arguments suggest that this is a case when we should > > break backward compatibility gratuitously.
> > We shall have to agree to disagree that this is gratuitous breakage. I > think thats going far too far. >
> > We may well be unable to do this cleanly and sanely. But that's a
> very
> > different sort of argument.
> > Basically we only have to worry about 'l' because of 'le', and 'f' > because of 'if'. Any others? > > > Not "if". it's already a syntax error because "i" is a valid option. > > Any of the following immediately following the delimiter are currently > valid, but will become a syntax error (e.g. /foo/le+1) or different > valid code (e.g. /foo/lt+1): > > * unless & until from /u > * le & lt from /l > * [none] from /t > > We're precluded from using these: > > * /a (and) > * /f (for, foreach) > * /n (ne) > * /w (when, while) >
I don't understand these preclusions. Why, for example, does the existence of 'and' preclude /a, but the existence of 'unless' not preclude /u ? FYI: Yves' original proposal was for an additional /a modifier to restrict the range of \s and \d et.al. to ASCII. That code remains to be written. How about this alternate solution: Instead of creating a syntax error, we deprecate in 5.14 not inserting a space between a pattern terminator and the following word. If one of the new modifiers in conjunction with other legal modifiers matches one of those legal words, we take the old behavior. The documentation will caution people writing new code to not do that, listing all the possibilities. And, I only see two such possibilities: 'le' and 'lt'. All the rest listed above require a modifier that doesn't exist. Someone is unlikely to use 'lt' anyway since the 't' just overrides the 'l'. I don't think it's too much of a burden for someone writing new code to not use 's/foo/bar/le' when the documentation warns against it. I'm also unsure that 't' is the best modifier. 'traditional' was suggested as an alternative to the preferred 'legacy', since everyone agreed that 'l' should stand for 'locale'. 'h' could stand for heritage or historical, or 'r' for retro, or v for vintage. or even 'vanilla'.
CC: demerphq <demerphq [...] gmail.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 14:49:06 -0400
To: karl williamson <public [...] khwilliamson.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.1k
On Wed, May 19, 2010 at 2:20 PM, karl williamson <public@khwilliamson.com>wrote: Show quoted text
> I don't understand these preclusions. Why, for example, does the existence > of 'and' preclude /a, but the existence of 'unless' not preclude /u ?
When used as a statement modifier. $ perl -le'$x+=/foo/unless$c; print "ok"' ok If we add /u, the above would die as follows: Bareword found where operator expected at -e line 1, near "/foo/unless" (Missing operator before nless?) syntax error at -e line 1, near "/foo/unless" Execution of -e aborted due to compilation errors. "Preclude" is not quite the right word, at least not on its own. They preclude the addition of the modifier without some form of conflict resolution. Most of the conflicts can even be resolved cleanly by lookahead. (/l isn't resolved cleanly by lookahead.) Show quoted text
> Instead of creating a syntax error, we deprecate in 5.14 not inserting a > space between a pattern terminator and the following word. >
That still breaks backwards compatibility, and we'd have to wait for 5.016 to get /u and /l in. "use 5.014;" avoids both the break and the wait. It could either add /u and /l, or it could add the space requirement.
CC: demerphq <demerphq [...] gmail.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 13:46:49 -0600
To: Eric Brine <ikegami [...] adaelis.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.8k
Eric Brine wrote: Show quoted text
> On Wed, May 19, 2010 at 2:20 PM, karl williamson > <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: > > I don't understand these preclusions. Why, for example, does the > existence of 'and' preclude /a, but the existence of 'unless' not > preclude /u ? > > > When used as a statement modifier. > > $ perl -le'$x+=/foo/unless$c; print "ok"' > ok > > If we add /u, the above would die as follows: > > Bareword found where operator expected at -e line 1, near "/foo/unless" > (Missing operator before nless?) > syntax error at -e line 1, near "/foo/unless" > Execution of -e aborted due to compilation errors. > > "Preclude" is not quite the right word, at least not on its own. They > preclude the addition of the modifier without some form of conflict > resolution. Most of the conflicts can even be resolved cleanly by > lookahead. (/l isn't resolved cleanly by lookahead.) > > > Instead of creating a syntax error, we deprecate in 5.14 not > inserting a space between a pattern terminator and the following word. > > > That still breaks backwards compatibility, and we'd have to wait for > 5.016 to get /u and /l in. > > "use 5.014;" avoids both the break and the wait. It could either add /u > and /l, or it could add the space requirement. >
I don't think you understood my suggestion; everything would take effect in 5.14. What I meant is that we could resolve things like the "unless' by lookahead. That is we special case the l and u (and /a if we get it) modifiers so that they don't take effect if the word they're in is a legal one; the complete list of which you've given (I think). The algorithm would be: the code would look first for the 5.12 modifier set, as currently. If that exhausts the word, continue as currently. Otherwise if the word is one of the few you've mentioned, also continue as currently, but raise the deprecated warning. Otherwise, reparse the word, this time allowing the new modifiers. If that exhausts the word, fine, we've got our modifiers. If not, raise the deprecated warning. A syntax error would also be generated if the word isn't recognized. This would guarantee backward compatibility, with no inappropriate syntax errors. /lt and /le would be resolved by documenting that these have the 5.12 meanings. This would be lifted in 5.16 after the deprecation cycle. I think the reasons to prefer my solution over yours is that it doesn't require a 'use 5.014'; which I always forget to include, and I prefer doing deprecation instead of syntax errors. (My first take is that modifying the 'use 5.014' solution to do deprecation would be very similar to taking my suggestion.) Otherwise, I'm fine with yours, except it requires me to learn a new area of Perl in order to make the patch. :) Does anyone else have an opinion?
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 21:05:39 +0100
To: Perl5 Porters <perl5-porters [...] perl.org>
From: Zefram <zefram [...] fysh.org>
Download (untitled) / with headers
text/plain 326b
karl williamson wrote: Show quoted text
> What I meant is that we could resolve things like the "unless' >by lookahead. That is we special case the l and u (and /a if we get it) >modifiers so that they don't take effect if the word they're in is a >legal one; the complete list of which you've given (I think).
Eww. -zefram
CC: Eric Brine <ikegami [...] adaelis.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 22:07:55 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
On 19 May 2010 21:46, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> Eric Brine wrote:
>> >> On Wed, May 19, 2010 at 2:20 PM, karl williamson <public@khwilliamson.com >> <mailto:public@khwilliamson.com>> wrote: >> >>    I don't understand these preclusions.  Why, for example, does the >>    existence of 'and' preclude /a, but the existence of 'unless' not >>    preclude /u ? >> >> When used as a statement modifier. >> >> $ perl -le'$x+=/foo/unless$c; print "ok"' >> ok >> >> If we add /u, the above would die as follows: >> >> Bareword found where operator expected at -e line 1, near "/foo/unless" >>        (Missing operator before nless?) >> syntax error at -e line 1, near "/foo/unless" >> Execution of -e aborted due to compilation errors. >> >> "Preclude" is not quite the right word, at least not on its own. They >> preclude the addition of the modifier without some form of conflict >> resolution. Most of the conflicts can even be resolved cleanly by lookahead. >> (/l isn't resolved cleanly by lookahead.) >> >>    Instead of creating a syntax error, we deprecate in 5.14 not >>    inserting a space between a pattern terminator and the following word. >> >> >> That still breaks backwards compatibility, and we'd have to wait for 5.016 >> to get /u and /l in. >> >> "use 5.014;" avoids both the break and the wait. It could either add /u >> and /l, or it could add the space requirement. >>
> > I don't think you understood my suggestion; everything would take effect in > 5.14.  What I meant is that we could resolve things like the "unless' by > lookahead.  That is we special case the l and u (and /a if we get it) > modifiers so that they don't take effect if the word they're in is a legal > one; the complete list of which you've given (I think). > > The algorithm would be: the code would look first for the 5.12 modifier set, > as currently.  If that exhausts the word, continue as currently. Otherwise > if the word is one of the few you've mentioned, also continue as currently, > but raise the deprecated warning.  Otherwise, reparse the word, this time > allowing the new modifiers.  If that exhausts the word, fine, we've got our > modifiers.  If not, raise the deprecated warning.  A syntax error would also > be generated if the word isn't recognized. > > This would guarantee backward compatibility, with no inappropriate syntax > errors.  /lt and /le would be resolved by documenting that these have the > 5.12 meanings.  This would be lifted in 5.16 after the deprecation cycle. > > I think the reasons to prefer my solution over yours is that it doesn't > require a 'use 5.014'; which I always forget to include, and I prefer doing > deprecation instead of syntax errors.  (My first take is that modifying the > 'use 5.014' solution to do deprecation would be very similar to taking my > suggestion.) > > Otherwise, I'm fine with yours, except it requires me to learn a new area of > Perl in order to make the patch. :) > Does anyone else have an opinion?
I kinda like your plan, with the exception that its going to be crufty code until we are past the deprecation cycle. One thing tho, it occurred to me that in this discussion we have omitted to mention one subtle point. We don't have to support new modifiers as trailing modifiers /at all/ if we don't want to, as we can always make them restricted to the (?msix:...) form. This means that you could have access to the syntax without the 'use 5.014' just not as a trailing modifier. One other thing.... these new flags are special in that they are essential mutually exclusive. Maybe we SHOULD make them capitalized to emphasize this fact. A simple rule like "you may only use one capitalized modifier at a time" is a pretty easy to remember as compared to "the modfiers /l /a /u and /r are all mutually exclusive" and with capital modifiers we dont have any back compat problems. Also, i think there is precedent in one of the other languages for a /U modifier, if ours does the same thing, all the better. Cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Eric Brine <ikegami [...] adaelis.com>, Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 22:09:53 +0200
To: Tom Christiansen <tchrist [...] perl.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 781b
On 19 May 2010 18:11, Tom Christiansen <tchrist@perl.com> wrote: Show quoted text
>>>> Only nit is i think covered by the -E option which implies a use LATEST; >>>> right?
>
>>> hum, neither perlrun nor feature mention the problem with backwards >>> compatibility of using -E. Maybe something like the following (but I'm not >>> fully awake yet):
>
>>> Since L<feature> is used to introduce features that are not backwards >>> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may >>> result in broken code when Perl is upgraded.
>
>> One wonders if we should mention that using one liners in production >> is not recommended?
> > It's not??
Anyone who has written a book on Perl is excepted from this rule. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Eric Brine <ikegami [...] adaelis.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 14:41:15 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.9k
demerphq wrote: Show quoted text
> On 19 May 2010 21:46, karl williamson <public@khwilliamson.com> wrote:
>> Eric Brine wrote:
>>> On Wed, May 19, 2010 at 2:20 PM, karl williamson <public@khwilliamson.com >>> <mailto:public@khwilliamson.com>> wrote: >>> >>> I don't understand these preclusions. Why, for example, does the >>> existence of 'and' preclude /a, but the existence of 'unless' not >>> preclude /u ? >>> >>> When used as a statement modifier. >>> >>> $ perl -le'$x+=/foo/unless$c; print "ok"' >>> ok >>> >>> If we add /u, the above would die as follows: >>> >>> Bareword found where operator expected at -e line 1, near "/foo/unless" >>> (Missing operator before nless?) >>> syntax error at -e line 1, near "/foo/unless" >>> Execution of -e aborted due to compilation errors. >>> >>> "Preclude" is not quite the right word, at least not on its own. They >>> preclude the addition of the modifier without some form of conflict >>> resolution. Most of the conflicts can even be resolved cleanly by lookahead. >>> (/l isn't resolved cleanly by lookahead.) >>> >>> Instead of creating a syntax error, we deprecate in 5.14 not >>> inserting a space between a pattern terminator and the following word. >>> >>> >>> That still breaks backwards compatibility, and we'd have to wait for 5.016 >>> to get /u and /l in. >>> >>> "use 5.014;" avoids both the break and the wait. It could either add /u >>> and /l, or it could add the space requirement. >>>
>> I don't think you understood my suggestion; everything would take effect in >> 5.14. What I meant is that we could resolve things like the "unless' by >> lookahead. That is we special case the l and u (and /a if we get it) >> modifiers so that they don't take effect if the word they're in is a legal >> one; the complete list of which you've given (I think). >> >> The algorithm would be: the code would look first for the 5.12 modifier set, >> as currently. If that exhausts the word, continue as currently. Otherwise >> if the word is one of the few you've mentioned, also continue as currently, >> but raise the deprecated warning. Otherwise, reparse the word, this time >> allowing the new modifiers. If that exhausts the word, fine, we've got our >> modifiers. If not, raise the deprecated warning. A syntax error would also >> be generated if the word isn't recognized. >> >> This would guarantee backward compatibility, with no inappropriate syntax >> errors. /lt and /le would be resolved by documenting that these have the >> 5.12 meanings. This would be lifted in 5.16 after the deprecation cycle. >> >> I think the reasons to prefer my solution over yours is that it doesn't >> require a 'use 5.014'; which I always forget to include, and I prefer doing >> deprecation instead of syntax errors. (My first take is that modifying the >> 'use 5.014' solution to do deprecation would be very similar to taking my >> suggestion.) >> >> Otherwise, I'm fine with yours, except it requires me to learn a new area of >> Perl in order to make the patch. :) >> Does anyone else have an opinion?
> > I kinda like your plan, with the exception that its going to be crufty > code until we are past the deprecation cycle.
I've actually looked at the code. More simply stated, aside from doing the deprecation, do everything eactly as currently still using the 5.12 modifier set. But when you get to the part where you would otherwise throw a syntax error, instead expand the modifier set to include the new ones and try again. That's all, not very much code. Show quoted text
> > One thing tho, it occurred to me that in this discussion we have > omitted to mention one subtle point. > > We don't have to support new modifiers as trailing modifiers /at all/ > if we don't want to, as we can always make them restricted to the > (?msix:...) form. > > This means that you could have access to the syntax without the 'use > 5.014' just not as a trailing modifier.
But I still find (?msix:) ugly. People I know think perl is "write-only" because they haven't gotten used to all the special characters, which I have grown accustomed to over the years. But still not that construct for me. But as a temporary 5.14 measure, I could accept that. Show quoted text
> > One other thing.... these new flags are special in that they are > essential mutually exclusive. Maybe we SHOULD make them capitalized to > emphasize this fact. > > A simple rule like "you may only use one capitalized modifier at a > time" is a pretty easy to remember as compared to "the modfiers /l /a > /u and /r are all mutually exclusive" and with capital modifiers we > dont have any back compat problems. > > Also, i think there is precedent in one of the other languages for a > /U modifier, if ours does the same thing, all the better.
I'm neutral on this. But I have been thinking lately that maybe the /a modifier wouldn't be mutually exclusive. I can see someone wanting to restrict \d and \w to ASCII while still wanting \U or \T behavior otherwise; less likely with \L. Show quoted text
> > Cheers, > yves > > > > > > > >
CC: Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, demerphq <demerphq [...] gmail.com>, Glenn Linderman <perl [...] NevCal.com>, Michael G Schwern <schwern [...] pobox.com>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 22:51:50 +0100
To: karl williamson <public [...] khwilliamson.com>
From: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
Download (untitled) / with headers
text/plain 850b
On Tue, May 11, 2010 at 12:54:01PM -0600, karl williamson wrote: Show quoted text
> These commits also add regex modifiers /u (unicode), /l (locale), and /t > (traditional). /a is not part of this patch. I have made up the term > "Matching mode" to describe this. I'm open to a better term, if you can > think of one.
It may perhaps be far too late to reconsider, but I'm not sure I like these notations. These are three mutually-exclusive settings along one axis, they are not three independent settings on three different axes, such as /l vs /g. Would it not make more sense to group them up under a single /u flag, something of the following: m/Unicode on/u m/Unicode off/u0 m/Unicode if locale says/ul m/Unicode traditionally/ut -- Paul "LeoNerd" Evans leonerd@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Download signature.asc
application/pgp-signature 190b

Message body not shown because it is not plain text.

Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 16:06:56 -0600
To: "Perl5-Porters" <perl5-porters [...] perl.org>
From: "Curtis Jewell" <perl [...] csjewell.fastmail.us>
Download (untitled) / with headers
text/plain 1.5k
On Wed, 19 May 2010 22:51 +0100, "Paul LeoNerd Evans" <leonerd@leonerd.org.uk> wrote: Show quoted text
> On Tue, May 11, 2010 at 12:54:01PM -0600, karl williamson wrote:
> > These commits also add regex modifiers /u (unicode), /l (locale), and /t > > (traditional). /a is not part of this patch. I have made up the term > > "Matching mode" to describe this. I'm open to a better term, if you can > > think of one.
> > It may perhaps be far too late to reconsider, but I'm not sure I like > these notations. These are three mutually-exclusive settings along one > axis, they are not three independent settings on three different axes, > such as /l vs /g. > > Would it not make more sense to group them up under a single /u flag, > something of the following: > > m/Unicode on/u > m/Unicode off/u0 > m/Unicode if locale says/ul > m/Unicode traditionally/ut
We do have the assumption that capital letters oppose their lowercase counterparts, as far as I can tell, so that the first two would be m/Unicode on/u m/Unicode off/U (I'm making the assumption we're adding a /U with that /u.) The question is, are the other two on an axis where we can say "/l applies only if /u, and /uL would be the equivalent of the proposed /t option?" (i.e. is locale/traditional a two state, rather than locale/something else/traditional being 3-state?) --Curtis Jewell -- Curtis Jewell csjewell@cpan.org http://csjewell.dreamwidth.org/ perl@csjewell.fastmail.us http://csjewell.comyr.org/perl/ "Your random numbers are not that random" -- perl-5.10.1.tar.gz/util.c Strawberry Perl for Windows betas: http://strawberryperl.com/beta/
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, demerphq <demerphq [...] gmail.com>, Glenn Linderman <perl [...] nevcal.com>, Michael G Schwern <schwern [...] pobox.com>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 22:11:35 +0000
To: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
From: Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>
Download (untitled) / with headers
text/plain 795b
On Wed, May 19, 2010 at 21:51, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote: Show quoted text
> Would it not make more sense to group them up under a single /u flag, > something of the following: > >  m/Unicode on/u >  m/Unicode off/u0 >  m/Unicode if locale says/ul >  m/Unicode traditionally/ut
I like these too, but hitherto all the flags were 1-byte and could be taken in any order, this would change that. That might break some things, like stuff that uses re.pm and some of the other regex APIs. Not a show stopper, but something to consider. Would it only be at the end, or would /foo/ulpg work? If we're going that route I think we might as well try to make the Perl 6 flag groups work, i.e.: m/Some stuff/c:u0:g That may be a no-go due to conflicting with some other syntax, though.
CC: Tom Christiansen <tchrist [...] perl.com>, Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 18:56:56 -0400
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 910b
On Wed, May 19, 2010 at 4:09 PM, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 19 May 2010 18:11, Tom Christiansen <tchrist@perl.com> wrote:
> >>>> Only nit is i think covered by the -E option which implies a use
> LATEST;
> >>>> right?
> >
> >>> hum, neither perlrun nor feature mention the problem with backwards > >>> compatibility of using -E. Maybe something like the following (but I'm
> not
> >>> fully awake yet):
> >
> >>> Since L<feature> is used to introduce features that are not backwards > >>> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >>
> may
> >>> result in broken code when Perl is upgraded.
> >
> >> One wonders if we should mention that using one liners in production > >> is not recommended?
> > > > It's not??
> > Anyone who has written a book on Perl is excepted from this rule. >
I don't have any books to my cred, and I use one-liners in batch and bash scripts.
CC: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>, Juerd Waalboer <juerd [...] convolution.nl>, demerphq <demerphq [...] gmail.com>, Glenn Linderman <perl [...] nevcal.com>, Michael G Schwern <schwern [...] pobox.com>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 18:58:30 -0400
To: Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 333b
On Wed, May 19, 2010 at 6:11 PM, Ævar Arnfjörð Bjarmason <avarab@gmail.com>wrote: Show quoted text
> > If we're going that route I think we might as well try to make the > Perl 6 flag groups work, i.e.: > > m/Some stuff/c:u0:g > > That may be a no-go due to conflicting with some other syntax, though. >
f?s/.../.../...:g is currently valid.
CC: Perl5-Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 19 May 2010 23:08:37 -0600
To: Curtis Jewell <perl [...] csjewell.fastmail.us>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.4k
Curtis Jewell wrote: Show quoted text
> On Wed, 19 May 2010 22:51 +0100, "Paul LeoNerd Evans" > <leonerd@leonerd.org.uk> wrote:
>> On Tue, May 11, 2010 at 12:54:01PM -0600, karl williamson wrote:
>>> These commits also add regex modifiers /u (unicode), /l (locale), and /t >>> (traditional). /a is not part of this patch. I have made up the term >>> "Matching mode" to describe this. I'm open to a better term, if you can >>> think of one.
>> It may perhaps be far too late to reconsider, but I'm not sure I like >> these notations. These are three mutually-exclusive settings along one >> axis, they are not three independent settings on three different axes, >> such as /l vs /g. >> >> Would it not make more sense to group them up under a single /u flag, >> something of the following: >> >> m/Unicode on/u >> m/Unicode off/u0 >> m/Unicode if locale says/ul >> m/Unicode traditionally/ut
> > We do have the assumption that capital letters oppose their lowercase > counterparts, as far as I can tell, so that the first two would be > > m/Unicode on/u > m/Unicode off/U > > (I'm making the assumption we're adding a /U with that /u.) > > The question is, are the other two on an axis where we can say "/l > applies only if /u, and /uL would be the equivalent of the proposed /t > option?" > > (i.e. is locale/traditional a two state, rather than locale/something > else/traditional being 3-state?)
It is tri-state, with each value excluding the other two, and maybe a fourth value will be added to make it quad-state.
CC: Tom Christiansen <tchrist [...] perl.com>, Jan Dubois <jand [...] activestate.com>, Jesse Vincent <jesse [...] fsck.com>, "?var Arnfj?r? Bjarmason" <avarab [...] gmail.com>, karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Thu, 20 May 2010 08:42:30 +0200
To: Eric Brine <ikegami [...] adaelis.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.1k
On 20 May 2010 00:56, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> On Wed, May 19, 2010 at 4:09 PM, demerphq <demerphq@gmail.com> wrote:
>> >> On 19 May 2010 18:11, Tom Christiansen <tchrist@perl.com> wrote:
>> >>>> Only nit is i think covered by the -E option which implies a use >> >>>> LATEST; >> >>>> right?
>> >
>> >>> hum, neither perlrun nor feature mention the problem with backwards >> >>> compatibility of using -E. Maybe something like the following (but I'm >> >>> not >> >>> fully awake yet):
>> >
>> >>> Since L<feature> is used to introduce features that are not backwards >> >>> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> >> >>> may >> >>> result in broken code when Perl is upgraded.
>> >
>> >> One wonders if we should mention that using one liners in production >> >> is not recommended?
>> > >> > It's not??
>> >> Anyone who has written a book on Perl is excepted from this rule.
> > I don't have any books to my cred, and I use one-liners in batch and bash > scripts.
T'would appear you have two options, replace them with scripts or get published like Tom. ;-) Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicodesemantics for \s, \w
Date: Thu, 20 May 2010 09:30:08 +0200
To: Zefram <zefram [...] fysh.org>
From: Steffen Mueller <smueller [...] cpan.org>
Download (untitled) / with headers
text/plain 778b
Hi all, Zefram wrote: Show quoted text
> Eric Brine wrote:
>> Since L<feature> is used to introduce features that are not backwards >> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may >> result in broken code when Perl is upgraded.
> > I thought that was implicit, and very obviously so, in the definition > of -E. It's the inherent tradeoff, the price one pays for having such > a short shorthand. But I wouldn't object to it being made explicit.
I agree with Zefram. This backwards-incompatible behavior is the *whole point* of -E. FWIW, I think documenting the current equivalent of -E (i.e. -e 'use 5.0XX;') is at best adding noise to the documentation. The perl core may have many problems but certainly, none of those is *lack* of docs. Cheers, Steffen
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicodesemantics for \s, \w
Date: Thu, 20 May 2010 12:57:09 +0100
To: smueller [...] cpan.org, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 1.3k
Quoth smueller@cpan.org (Steffen Mueller): Show quoted text
> Hi all, > > Zefram wrote:
> > Eric Brine wrote:
> >> Since L<feature> is used to introduce features that are not backwards > >> compatible, using C<< -E'...' >> instead of C<< -e'use 5.014; ...' >> may > >> result in broken code when Perl is upgraded.
> > > > I thought that was implicit, and very obviously so, in the definition > > of -E. It's the inherent tradeoff, the price one pays for having such > > a short shorthand. But I wouldn't object to it being made explicit.
> > I agree with Zefram. This backwards-incompatible behavior is the *whole > point* of -E. > > FWIW, I think documenting the current equivalent of -E (i.e. -e 'use > 5.0XX;') is at best adding noise to the documentation. The perl core may > have many problems but certainly, none of those is *lack* of docs.
Is it worth documenting the shorter version of that, that is perl -M5.10.0 -e'...' ? That seems like a good compromise, to me, for people who aren't actually typing this into a shell. I realise it's an obvious consequence of other features, but it might be worth trying to push people in the direction of using it instead of -E. Otherwise, in five years' time, we'll get people saying 'but you can't add anything to -E! I've got four-and-a-half million one-liners that will *break*!' Ben
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Thu, 20 May 2010 13:03:36 +0100
To: public [...] khwilliamson.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 2.1k
Quoth public@khwilliamson.com (karl williamson): Show quoted text
> Curtis Jewell wrote:
> > On Wed, 19 May 2010 22:51 +0100, "Paul LeoNerd Evans" > > <leonerd@leonerd.org.uk> wrote:
> >> On Tue, May 11, 2010 at 12:54:01PM -0600, karl williamson wrote:
> >>> These commits also add regex modifiers /u (unicode), /l (locale), and /t > >>> (traditional). /a is not part of this patch. I have made up the term > >>> "Matching mode" to describe this. I'm open to a better term, if you can > >>> think of one.
> >> It may perhaps be far too late to reconsider, but I'm not sure I like > >> these notations. These are three mutually-exclusive settings along one > >> axis, they are not three independent settings on three different axes, > >> such as /l vs /g. > >> > >> Would it not make more sense to group them up under a single /u flag, > >> something of the following: > >> > >> m/Unicode on/u > >> m/Unicode off/u0 > >> m/Unicode if locale says/ul > >> m/Unicode traditionally/ut
> > > > We do have the assumption that capital letters oppose their lowercase > > counterparts, as far as I can tell, so that the first two would be > > > > m/Unicode on/u > > m/Unicode off/U > > > > (I'm making the assumption we're adding a /U with that /u.) > > > > The question is, are the other two on an axis where we can say "/l > > applies only if /u, and /uL would be the equivalent of the proposed /t > > option?" > > > > (i.e. is locale/traditional a two state, rather than locale/something > > else/traditional being 3-state?)
> > It is tri-state, with each value excluding the other two, and maybe a > fourth value will be added to make it quad-state.
Since there aren't any upper-case modifiers yet, it would be possible to introduce the rule 'Upper-case modifiers take a single-character argument' at this point. This would give a syntax like m/unicode /Uugx; m/locale /Ulgx; m/traditional/Utgx; which seems at least as clear as three or four random letters that happen to be mutually-exclusive. While I like the \s\S symmetry of the regex escapes, that doesn't apply here (yet), so we don't need to keep it if it's inconvenient (which it is, since here we are with a three-state option). Ben
CC: smueller [...] cpan.org, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicodesemantics for \s, \w
Date: Thu, 20 May 2010 14:32:04 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: Frank Wiegand <frank.wiegand [...] gmail.com>
Download (untitled) / with headers
text/plain 1.1k
Am Donnerstag, den 20.05.2010, 12:57 +0100 schrieb Ben Morrow: Show quoted text
> Quoth smueller@cpan.org (Steffen Mueller):
> > FWIW, I think documenting the current equivalent of -E (i.e. -e 'use > > 5.0XX;') is at best adding noise to the documentation. The perl core may > > have many problems but certainly, none of those is *lack* of docs.
> > Is it worth documenting the shorter version of that, that is > > perl -M5.10.0 -e'...'
% perl -E '$notset' % perl -e 'use 5.012; $notset' Global symbol "$notset" requires explicit package name at -e line 1. Execution of -e aborted due to compilation errors. C<use 5.012> enables feature.pm *and* strictures (without loading strict.pm). -E (on 5.012) is the same as % perl -e 'use feature ":5.12"' But how to document this generically? % perl -e 'use feature ":$^V"' Feature bundle "v5.12.1" is not supported by Perl 5.12.1 at -e line 1 BEGIN failed--compilation aborted at -e line 1. % perl -e 'use feature ":$]"' Feature bundle "5.012001" is not supported by Perl 5.12.1 at -e line 1 BEGIN failed--compilation aborted at -e line 1. Frank
CC: Ben Morrow <ben [...] morrow.me.uk>, smueller [...] cpan.org, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicodesemantics for \s, \w
Date: Thu, 20 May 2010 12:45:13 +0000
To: Frank Wiegand <frank.wiegand [...] gmail.com>
From: Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>
Download (untitled) / with headers
text/plain 1.7k
On Thu, May 20, 2010 at 12:32, Frank Wiegand <frank.wiegand@gmail.com> wrote: Show quoted text
> Am Donnerstag, den 20.05.2010, 12:57 +0100 schrieb Ben Morrow:
>> Quoth smueller@cpan.org (Steffen Mueller):
>> > FWIW, I think documenting the current equivalent of -E (i.e. -e 'use >> > 5.0XX;') is at best adding noise to the documentation. The perl core may >> > have many problems but certainly, none of those is *lack* of docs.
>> >> Is it worth documenting the shorter version of that, that is >> >>     perl -M5.10.0 -e'...'
> >        % perl -E '$notset' > >        % perl -e 'use 5.012; $notset' >        Global symbol "$notset" requires explicit package name at -e line 1. >        Execution of -e aborted due to compilation errors. > > C<use 5.012> enables feature.pm *and* strictures (without loading strict.pm). > > -E (on 5.012) is the same as > >        % perl -e 'use feature ":5.12"' > > But how to document this generically? > >        % perl -e 'use feature ":$^V"' >        Feature bundle "v5.12.1" is not supported by Perl 5.12.1 at -e line 1 >        BEGIN failed--compilation aborted at -e line 1. > >        % perl -e 'use feature ":$]"' >        Feature bundle "5.012001" is not supported by Perl 5.12.1 at -e line 1 >        BEGIN failed--compilation aborted at -e line 1. > > > Frank >
By saying that we have bundles for major versions starting with 5.10. This also needs to be brought up to date: """ It's possible to load a whole slew of features in one go, using a I<feature bundle>. The name of a feature bundle is prefixed with a colon, to distinguish it from an actual feature. At present, the only feature bundle is C<use feature ":5.10"> which is equivalent to C<use feature qw(switch say state)>. """
CC: public [...] khwilliamson.com, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Thu, 20 May 2010 16:14:16 +0100
To: Ben Morrow <ben [...] morrow.me.uk>
From: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
Download (untitled) / with headers
text/plain 930b
On Thu, May 20, 2010 at 01:03:36PM +0100, Ben Morrow wrote: Show quoted text
> Since there aren't any upper-case modifiers yet, it would be possible to > introduce the rule 'Upper-case modifiers take a single-character > argument' at this point. This would give a syntax like > > m/unicode /Uugx; > m/locale /Ulgx; > m/traditional/Utgx; > > which seems at least as clear as three or four random letters that > happen to be mutually-exclusive.
Oh, now I do like that. It has all the neatness of my suggestion, but with the added bonus that now we know that capitals take a sub-flag, which we should look at the next letter afterwards. Perhaps we could establish this as a general precident? I have no idea what this means m/foo/AzTlxMvg but at least I can parse it as A=z, T=l, x=true, M=v, g=true -- Paul "LeoNerd" Evans leonerd@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Download signature.asc
application/pgp-signature 190b

Message body not shown because it is not plain text.

CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Mon, 24 May 2010 07:15:12 -0600
To: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.1k
Paul LeoNerd Evans wrote: Show quoted text
> On Thu, May 20, 2010 at 01:03:36PM +0100, Ben Morrow wrote:
>> Since there aren't any upper-case modifiers yet, it would be possible to >> introduce the rule 'Upper-case modifiers take a single-character >> argument' at this point. This would give a syntax like >> >> m/unicode /Uugx; >> m/locale /Ulgx; >> m/traditional/Utgx; >> >> which seems at least as clear as three or four random letters that >> happen to be mutually-exclusive.
> > Oh, now I do like that. It has all the neatness of my suggestion, but > with the added bonus that now we know that capitals take a sub-flag, > which we should look at the next letter afterwards. > > Perhaps we could establish this as a general precident? I have no idea > what this means > > m/foo/AzTlxMvg > > but at least I can parse it as A=z, T=l, x=true, M=v, g=true >
Somehow, I'm a little leery of this, but I can't put my finger on it. I'll wait a little longer to see what others may say. One thing though if we do go this route, I think that U isn't as good a choice as maybe M for "matching mode", as Ul really isn't about Unicode: it is about locale.
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 25 May 2010 15:37:30 +0100
To: karl williamson <public [...] khwilliamson.com>
From: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
Download (untitled) / with headers
text/plain 732b
On Mon, May 24, 2010 at 07:15:12AM -0600, karl williamson wrote: Show quoted text
> >> m/unicode /Uugx; > >> m/locale /Ulgx; > >> m/traditional/Utgx;
> > Somehow, I'm a little leery of this, but I can't put my finger on > it. I'll wait a little longer to see what others may say. > > One thing though if we do go this route, I think that U isn't as > good a choice as maybe M for "matching mode", as Ul really isn't > about Unicode: it is about locale.
Another thought on 'U' comes to mind - no keyword starts with a capital U. That means we can't possibly break existing syntax doing m/pattern/KEYWORD -- Paul "LeoNerd" Evans leonerd@leonerd.org.uk ICQ# 4135350 | Registered Linux# 179460 http://www.leonerd.org.uk/
Download signature.asc
application/pgp-signature 190b

Message body not shown because it is not plain text.

CC: karl williamson <public [...] khwilliamson.com>, Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Wed, 26 May 2010 02:24:49 +0100
To: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
From: James Mastros <james [...] mastros.biz>
On 25 May 2010 15:37, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote: Show quoted text
> On Mon, May 24, 2010 at 07:15:12AM -0600, karl williamson wrote:
>> >>    m/unicode    /Uugx; >> >>    m/locale     /Ulgx; >> >>    m/traditional/Utgx;
>> >> Somehow, I'm a little leery of this, but I can't put my finger on >> it. I'll wait a little longer to see what others may say. >> >> One thing though if we do go this route, I think that U isn't as >> good a choice as maybe M for "matching mode", as Ul really isn't >> about Unicode: it is about locale.
That suggests that we should possibly be calling it /L: /Lt (traditional), /Lu (unicode), /Ll (locale). Sadly, /Ll isn't overly clear, given the two different definitions of the word "locale" in the same statement. /Lp (posix) would be clearer, perhaps, but is overly technical. Another possibility is calling it /I, for internationalization, but that reads to the sad /Il -- where one of those is an i-for-India , and the other an l-for-lima. -=- James Mastros, theorbtwo
CC: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>, Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 25 May 2010 19:35:43 -0600
To: James Mastros <james [...] mastros.biz>
From: karl williamson <public [...] khwilliamson.com>
James Mastros wrote: Show quoted text
> On 25 May 2010 15:37, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>> On Mon, May 24, 2010 at 07:15:12AM -0600, karl williamson wrote:
>>>>> m/unicode /Uugx; >>>>> m/locale /Ulgx; >>>>> m/traditional/Utgx;
>>> Somehow, I'm a little leery of this, but I can't put my finger on >>> it. I'll wait a little longer to see what others may say. >>> >>> One thing though if we do go this route, I think that U isn't as >>> good a choice as maybe M for "matching mode", as Ul really isn't >>> about Unicode: it is about locale.
> > That suggests that we should possibly be calling it /L: /Lt > (traditional), /Lu (unicode), /Ll (locale). Sadly, /Ll isn't overly > clear, given the two different definitions of the word "locale" in the > same statement. /Lp (posix) would be clearer, perhaps, but is overly > technical. Another possibility is calling it /I, for > internationalization, but that reads to the sad /Il -- where one of > those is an i-for-India , and the other an l-for-lima. > > -=- James Mastros, theorbtwo >
I don't get what L would stand for. Please elaborate.
Subject: Please consider applying several commits in PATCH: [perl #58182]
Date: Wed, 26 May 2010 18:23:50 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1021b
The overall patch I submitted had a number of commits, most of which shouldn't be controversial. These were to fix typos and minor errors I discovered along the way. So I'm attaching those patches again; please consider applying any or all that you're comfortable with. All but the final one of this set are trivial patches, and I think should be added regardless of whatever else happens. The patch file here numbered 6 is not trivial; it adds the semantics of unicode strings, but not the regex modifiers; with it \s and \w work with Unicode semantics when under use feature unicode_strings. But any compiled regex will lose memory of that when compiled into another regex. For that functionality we need to have modifiers specified, and what those will be isn't clear yet. So I don't know if this one should be applied in isolation. It would allow for smoking of the underlying functionality while we decide the UI for the modifiers. It needs patch numbered 2 in order to not exceed array bounds.
Download 0001-Typo.patch
text/plain 640b
From a297b4fe606e09a5e218a7af32fb23517ccf8ca6 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 16:01:13 -0700 Subject: [PATCH] Typo --- universal.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/universal.c b/universal.c index 5a2cddb..e4fe9be 100644 --- a/universal.c +++ b/universal.c @@ -1289,7 +1289,7 @@ XS(XS_re_regexp_pattern) if ((re = SvRX(ST(0)))) /* assign deliberate */ { - /* Housten, we have a regex! */ + /* Houston, we have a regex! */ SV *pattern; STRLEN left = 0; char reflags[6]; -- 1.5.6.3
From 0fc54becc48ebfe38c45d0b4eb2a41529074ec41 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 16:02:14 -0700 Subject: [PATCH] Use sizeof instead of hard-coded array size The array should be declared with its actual size. --- universal.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/universal.c b/universal.c index e4fe9be..97d2f18 100644 --- a/universal.c +++ b/universal.c @@ -1292,7 +1292,7 @@ XS(XS_re_regexp_pattern) /* Houston, we have a regex! */ SV *pattern; STRLEN left = 0; - char reflags[6]; + char reflags[sizeof(INT_PAT_MODS)]; if ( GIMME_V == G_ARRAY ) { /* -- 1.5.6.3
From 9991fad598f21e55e22d23af7a396290a5c5d0cd Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 17:36:46 -0700 Subject: [PATCH] Add tested for corrupted regnode --- regcomp.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/regcomp.c b/regcomp.c index 337f0c4..84c7dc1 100644 --- a/regcomp.c +++ b/regcomp.c @@ -9841,6 +9841,10 @@ Perl_regnext(pTHX_ register regnode *p) if (!p) return(NULL); + if (OP(p) > REGNODE_MAX) { /* regnode.type is unsigned */ + Perl_croak(aTHX_ "Corrupted regexp opcode %d > %d", (int)OP(p), (int)REGNODE_MAX); + } + offset = (reg_off_by_arg[OP(p)] ? ARG(p) : NEXT_OFF(p)); if (offset == 0) return(NULL); -- 1.5.6.3
From a04b6010ab24747974feff2214d22f2c7a08d23a Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 19:19:22 -0700 Subject: [PATCH] Display characters as Unicode for clarity --- t/re/pat_special_cc.t | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/t/re/pat_special_cc.t b/t/re/pat_special_cc.t index 1138cbb..36116b8 100644 --- a/t/re/pat_special_cc.t +++ b/t/re/pat_special_cc.t @@ -37,6 +37,7 @@ sub run_tests { my @plain_complement_failed; for my $ord (0 .. $upper_bound) { my $ch= chr $ord; + my $ord = sprintf "U+%04X", $ord; # For display in Unicode terms my $plain= $ch=~/$special/ ? 1 : 0; my $plain_u= $ch=~/$upper/ ? 1 : 0; push @plain_complement_failed, "$ord-$plain-$plain_u" if $plain == $plain_u; -- 1.5.6.3
From 8d24edd55e96df3ce466ce4888910038db233b4d Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Thu, 14 Jan 2010 21:21:37 -0700 Subject: [PATCH] Clarify that count is bytes not unicode characters --- regexec.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/regexec.c b/regexec.c index 17a0dc6..0f67a65 100644 --- a/regexec.c +++ b/regexec.c @@ -2194,7 +2194,7 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char *stre RE_PV_QUOTED_DECL(quoted,do_utf8,PERL_DEBUG_PAD_ZERO(1), s,strend-s,60); PerlIO_printf(Perl_debug_log, - "Matching stclass %.*s against %s (%d chars)\n", + "Matching stclass %.*s against %s (%d bytes)\n", (int)SvCUR(prop), SvPVX_const(prop), quoted, (int)(strend - s)); } -- 1.5.6.3

Message body is not shown because it is too large.

CC: Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Please consider applying several commits in PATCH: [perl #58182]
Date: Fri, 28 May 2010 23:28:15 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.8k
On 27 May 2010 02:23, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> The overall patch I submitted had a number of commits, most of which > shouldn't be controversial.  These were to fix typos and minor errors I > discovered along the way. > > So I'm attaching those patches again; please consider applying any or all > that you're comfortable with. > > All but the final one of this set are trivial patches, and I think should be > added regardless of whatever else happens. > > The patch file here numbered 6 is not trivial; it adds the semantics of > unicode strings, but not the regex modifiers; with it \s and \w work with > Unicode semantics when under use feature unicode_strings.  But any compiled > regex will lose memory of that when compiled into another regex.  For that > functionality we need to have modifiers specified, and what those will be > isn't clear yet.  So I don't know if this one should be applied in > isolation.  It would allow for smoking of the underlying functionality while > we decide the UI for the modifiers.  It needs patch numbered 2 in order to > not exceed array bounds. >
Patch 1 fixes a comment I wrote. Should be applied. Patch 2 fixes code I originally wrote. Should be applied. I wonder if there isnt similar code elsewhere... Patch 3 is related to some code I added to detect buffer overruns in the compiled patters, and IMO should be applied Patch 4 improves test code I wrote and should be added. Patch 5 is (woohoo) nothing to do with me, but looks good. Patch 6 I think should be dealt with independently. Is there anywhere I can pull these from? I still havent set up a sane way to download patches from gmail. (email clients are by definition evil). cheers, yves ps: did I already ask for a clone of Karl yet? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: Please consider applying several commits in PATCH: [perl #58182]
Date: Sat, 29 May 2010 10:44:44 -0500
To: demerphq <demerphq [...] gmail.com>
From: "Craig A. Berry" <craig.a.berry [...] gmail.com>
Download (untitled) / with headers
text/plain 4.3k
On Fri, May 28, 2010 at 4:28 PM, demerphq <demerphq@gmail.com> wrote: Show quoted text
> On 27 May 2010 02:23, karl williamson <public@khwilliamson.com> wrote:
>> The overall patch I submitted had a number of commits, most of which >> shouldn't be controversial.  These were to fix typos and minor errors I >> discovered along the way. >> >> So I'm attaching those patches again; please consider applying any or all >> that you're comfortable with. >> >> All but the final one of this set are trivial patches, and I think should be >> added regardless of whatever else happens. >> >> The patch file here numbered 6 is not trivial; it adds the semantics of >> unicode strings, but not the regex modifiers; with it \s and \w work with >> Unicode semantics when under use feature unicode_strings.  But any compiled >> regex will lose memory of that when compiled into another regex.  For that >> functionality we need to have modifiers specified, and what those will be >> isn't clear yet.  So I don't know if this one should be applied in >> isolation.  It would allow for smoking of the underlying functionality while >> we decide the UI for the modifiers.  It needs patch numbered 2 in order to >> not exceed array bounds. >>
> > Patch 1 fixes a comment I wrote. Should be applied. > > Patch 2 fixes code I originally wrote. Should be applied. I wonder if > there isnt similar code elsewhere... > > Patch 3  is related to some code I added to detect buffer overruns in > the compiled patters, and IMO should be applied > > Patch 4 improves test code I wrote and should be added. > > Patch 5 is (woohoo) nothing to do with me, but looks good. > > Patch 6 I think should be dealt with independently.
1-5 are in. 6 does not apply cleanly and it sounds like Yves wants more discussion anyway. % git am ../unapp/00*.patch Applying: Typo Applying: Use sizeof instead of hard-coded array size Applying: Add tested for corrupted regnode Applying: Display characters as Unicode for clarity Applying: Clarify that count is bytes not unicode characters Applying: Add Unicode semantics to regex case-sensitive matching /Users/craig/perlrep/perl/.git/rebase-apply/patch:76: trailing whitespace. my @w = (0) x 256; /Users/craig/perlrep/perl/.git/rebase-apply/patch:130: trailing whitespace. /Users/craig/perlrep/perl/.git/rebase-apply/patch:167: trailing whitespace. /Users/craig/perlrep/perl/.git/rebase-apply/patch:315: trailing whitespace. ANYOF_BITMAP_SET(data->start_class, value); /Users/craig/perlrep/perl/.git/rebase-apply/patch:389: trailing whitespace. ANYOF_BITMAP_SET(data->start_class, value); error: patch failed: handy.h:460 error: handy.h: patch does not apply error: patch failed: op.h:357 error: op.h: patch does not apply error: patch failed: regexp.h:272 error: regexp.h: patch does not apply Patch failed at 0006 Add Unicode semantics to regex case-sensitive matching When you have resolved this problem run "git am --resolved". If you would prefer to skip this patch, instead run "git am --skip". To restore the original branch and stop patching run "git am --abort". Show quoted text
> Is there anywhere I can pull these from? I still havent set up a sane > way to download patches from gmail. (email clients are by definition > evil).
One way is to click the drop-down next to the Reply button and select Show Original. Then do a Save As on that page in your browser to create a single text file containing the whole message. That only works if the attachments have text content-type, which these do. Unfortunately, git am could not figure out what type of patch was in the file. GNU patch could process the file without difficultly, but then you lose the commit messages and attributions, so I didn't use this method in this case. Gmail has a "Download all attachments" link for messages with multiple attachments. In principle you get a zip file which you could unpack in a directory in a directory of your choosing and run git am on that directory. Unfortunately Gmail fails with "Bad request" when I click on that link, so I didn't use that method either. I ended up just clicking on each attachment individually and running git am on the lot of them. To me that's still a lot easier than setting up a remote and pulling from it. Show quoted text
> ps: did I already ask for a clone of Karl yet?
As far as I know, he's not a sheep and doesn't live in England, so that's still a TODO :-).
CC: Steffen Mueller <smueller [...] cpan.org>, Jan Dubois <jand [...] activestate.com>, 'Eric Brine' <ikegami [...] adaelis.com>, 'demerphq' <demerphq [...] gmail.com>, 'Jesse Vincent' <jesse [...] fsck.com>, '?var Arnfj?r? Bjarmason' <avarab [...] gmail.com>, 'karl williamson' <public [...] khwilliamson.com>, 'Perl5 Porters' <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 8 Jun 2010 17:14:41 +0200
To: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
From: Abigail <abigail [...] abigail.be>
Download (untitled) / with headers
text/plain 1.3k
On Wed, May 19, 2010 at 10:22:57AM +0200, H.Merijn Brand wrote: Show quoted text
> On Wed, 19 May 2010 09:56:59 +0200, Steffen Mueller <smueller@cpan.org> > wrote: >
> > Hi all, > > > > Jan Dubois wrote:
> > > On Tue, 18 May 2010, Eric Brine wrote:
> > >> On Tue, May 18, 2010 at 4:04 PM, demerphq <demerphq@gmail.com> wrote:
> > >>> Basically we only have to worry about 'l' because of 'le', and 'f' > > >>> because of 'if'. Any others?
> > > > [...] > >
> > >> Any of the following immediately following the delimiter are currently > > >> valid, but will become a syntax error (e.g. /foo/le+1) or different valid > > >> code (e.g. /foo/lt+1): > > >> > > >> - unless & until from /u > > >> - le & lt from /l > > >> - [none] from /t > > >> > > >> We're precluded from using these: > > >> > > >> - /a (and) > > >> - /f (for, foreach) > > >> - /n (ne) > > >> - /w (when, while)
> > > > I think these are MUCH more likely to be a problem than the three above.
> > I have used '/pat/and action' a LOT in one-liners, alwyas being aware > that 'and' works, and 'or' doesn't
But one-liners are just one-liners. Write once, run once. Maybe twice using the history function of ones shell. I'm all for backwards compatability, but "it's going to break my one-liners" isn't much of an argument, IMO. Abigail; '/pat/&&action' saves two characters.
CC: Eric Brine <ikegami [...] adaelis.com>, demerphq <demerphq [...] gmail.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð Bjarmason <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 8 Jun 2010 17:32:42 +0200
To: karl williamson <public [...] khwilliamson.com>
From: Abigail <abigail [...] abigail.be>
Download (untitled) / with headers
text/plain 3.7k
On Wed, May 19, 2010 at 01:46:49PM -0600, karl williamson wrote: Show quoted text
> Eric Brine wrote:
>> On Wed, May 19, 2010 at 2:20 PM, karl williamson >> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: >> >> I don't understand these preclusions. Why, for example, does the >> existence of 'and' preclude /a, but the existence of 'unless' not >> preclude /u ? >> >> >> When used as a statement modifier. >> >> $ perl -le'$x+=/foo/unless$c; print "ok"' >> ok >> >> If we add /u, the above would die as follows: >> >> Bareword found where operator expected at -e line 1, near "/foo/unless" >> (Missing operator before nless?) >> syntax error at -e line 1, near "/foo/unless" >> Execution of -e aborted due to compilation errors. >> >> "Preclude" is not quite the right word, at least not on its own. They >> preclude the addition of the modifier without some form of conflict >> resolution. Most of the conflicts can even be resolved cleanly by >> lookahead. (/l isn't resolved cleanly by lookahead.) >> >> >> Instead of creating a syntax error, we deprecate in 5.14 not >> inserting a space between a pattern terminator and the following word. >> >> >> That still breaks backwards compatibility, and we'd have to wait for >> 5.016 to get /u and /l in. >> >> "use 5.014;" avoids both the break and the wait. It could either add /u >> and /l, or it could add the space requirement. >>
> > I don't think you understood my suggestion; everything would take effect > in 5.14. What I meant is that we could resolve things like the "unless' > by lookahead. That is we special case the l and u (and /a if we get it) > modifiers so that they don't take effect if the word they're in is a > legal one; the complete list of which you've given (I think). > > The algorithm would be: the code would look first for the 5.12 modifier > set, as currently. If that exhausts the word, continue as currently. > Otherwise if the word is one of the few you've mentioned, also continue > as currently, but raise the deprecated warning. Otherwise, reparse the > word, this time allowing the new modifiers. If that exhausts the word, > fine, we've got our modifiers. If not, raise the deprecated warning. A > syntax error would also be generated if the word isn't recognized. > > This would guarantee backward compatibility, with no inappropriate > syntax errors. /lt and /le would be resolved by documenting that these > have the 5.12 meanings. This would be lifted in 5.16 after the > deprecation cycle. > > I think the reasons to prefer my solution over yours is that it doesn't > require a 'use 5.014'; which I always forget to include, and I prefer > doing deprecation instead of syntax errors. (My first take is that > modifying the 'use 5.014' solution to do deprecation would be very > similar to taking my suggestion.) > > Otherwise, I'm fine with yours, except it requires me to learn a new > area of Perl in order to make the patch. :) > Does anyone else have an opinion?
I'm not found of "we're now going to introduce a behaviour change - part of its to warn for another behaviour change in the next release". It's great if ones goal is to write code that outputs different things depending on different versions (the more different things the better), but in general, no. But I like the idea of "see if the modifiers spell something else first". Why not: - read the modifiers; if it can be consider as a statement modifier or binary operator, do so (with the exception of the existing /x). - else, treat them as modifiers. Do this for 5.14, 5.16, 5.18, etc. If that means that '/foo/le;' will be parsed as '/foo/ le;', and then die as a syntax error; then so be it. /l doesn't exist yet, so one can always learn to write '/foo/el;'. Abigail
CC: karl williamson <public [...] khwilliamson.com>, Eric Brine <ikegami [...] adaelis.com>, Jesse Vincent <jesse [...] fsck.com>, Ævar Arnfjörð <avarab [...] gmail.com>, Perl5 Porters <perl5-porters [...] perl.org>
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Thu, 10 Jun 2010 12:22:20 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 3.3k
On 8 June 2010 17:32, Abigail <abigail@abigail.be> wrote: Show quoted text
> On Wed, May 19, 2010 at 01:46:49PM -0600, karl williamson wrote:
>> Eric Brine wrote:
>>> On Wed, May 19, 2010 at 2:20 PM, karl williamson >>> <public@khwilliamson.com <mailto:public@khwilliamson.com>> wrote: >>> >>>     I don't understand these preclusions.  Why, for example, does the >>>     existence of 'and' preclude /a, but the existence of 'unless' not >>>     preclude /u ? >>> >>> >>> When used as a statement modifier. >>> >>> $ perl -le'$x+=/foo/unless$c; print "ok"' >>> ok >>> >>> If we add /u, the above would die as follows: >>> >>> Bareword found where operator expected at -e line 1, near "/foo/unless" >>>         (Missing operator before nless?) >>> syntax error at -e line 1, near "/foo/unless" >>> Execution of -e aborted due to compilation errors. >>> >>> "Preclude" is not quite the right word, at least not on its own. They >>> preclude the addition of the modifier without some form of conflict >>> resolution. Most of the conflicts can even be resolved cleanly by >>> lookahead. (/l isn't resolved cleanly by lookahead.) >>> >>> >>>     Instead of creating a syntax error, we deprecate in 5.14 not >>>     inserting a space between a pattern terminator and the following word. >>> >>> >>> That still breaks backwards compatibility, and we'd have to wait for >>> 5.016 to get /u and /l in. >>> >>> "use 5.014;" avoids both the break and the wait. It could either add /u >>> and /l, or it could add the space requirement. >>>
>> >> I don't think you understood my suggestion; everything would take effect >> in 5.14.  What I meant is that we could resolve things like the "unless' >> by lookahead.  That is we special case the l and u (and /a if we get it) >> modifiers so that they don't take effect if the word they're in is a >> legal one; the complete list of which you've given (I think). >> >> The algorithm would be: the code would look first for the 5.12 modifier >> set, as currently.  If that exhausts the word, continue as currently. >> Otherwise if the word is one of the few you've mentioned, also continue >> as currently, but raise the deprecated warning.  Otherwise, reparse the >> word, this time allowing the new modifiers.  If that exhausts the word, >> fine, we've got our modifiers.  If not, raise the deprecated warning.  A >> syntax error would also be generated if the word isn't recognized. >> >> This would guarantee backward compatibility, with no inappropriate >> syntax errors.  /lt and /le would be resolved by documenting that these >> have the 5.12 meanings.  This would be lifted in 5.16 after the >> deprecation cycle. >> >> I think the reasons to prefer my solution over yours is that it doesn't >> require a 'use 5.014'; which I always forget to include, and I prefer >> doing deprecation instead of syntax errors.  (My first take is that >> modifying the 'use 5.014' solution to do deprecation would be very >> similar to taking my suggestion.) >> >> Otherwise, I'm fine with yours, except it requires me to learn a new >> area of Perl in order to make the patch. :) >> Does anyone else have an opinion?
> > > I'm not found of "we're now going to introduce a behaviour change - part > of its to warn for another behaviour change in the next release".
I don't see any other choice given our deprecation policies. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH: [perl #58182] partial, "The Unicode Bug". Add unicode semantics for \s, \w
Date: Tue, 22 Jun 2010 20:12:44 -0600
To: Paul LeoNerd Evans <leonerd [...] leonerd.org.uk>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.1k
Paul LeoNerd Evans wrote: Show quoted text
> On Thu, May 20, 2010 at 01:03:36PM +0100, Ben Morrow wrote:
>> Since there aren't any upper-case modifiers yet, it would be possible to >> introduce the rule 'Upper-case modifiers take a single-character >> argument' at this point. This would give a syntax like >> >> m/unicode /Uugx; >> m/locale /Ulgx; >> m/traditional/Utgx; >> >> which seems at least as clear as three or four random letters that >> happen to be mutually-exclusive.
> > Oh, now I do like that. It has all the neatness of my suggestion, but > with the added bonus that now we know that capitals take a sub-flag, > which we should look at the next letter afterwards. > > Perhaps we could establish this as a general precident? I have no idea > what this means > > m/foo/AzTlxMvg > > but at least I can parse it as A=z, T=l, x=true, M=v, g=true >
I think we should go with Ben's idea. I had suggested using M for mode, other suggestions were L apparently for locale, and I for internationalization. I'm now thinking C for character set. Cu for Unicode Cl for current locale Ct for traditional, or Cm for mixed, which I think is clearer. Later there could be Ca for ASCII or Cp for Posix.
Subject: PATCH [perl #58182] partial: user-defined casing
Date: Sat, 10 Jul 2010 16:14:58 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.4k
Since my recent posts have been quickly answered, I thought that now would be a good time best to push this warnocked item through as a real patch that I want to be considered, and picked apart. As Jesse says, a real patch can get people to come out of the woodwork. This is my best effort with my current knowledge of the perl core to solve this problem, but it deals in areas that I don't know much about. make regen required Below is a copy of the main commit message. I have noted my uncertainties. The patch is also pushed to git://github.com/khwilliamson/perl.git branch user The attached patch removes another component of the "Unicode bug", in which semantics is affected by the internal storage state of being in utf8 or not. (Three components are left after this, all dealing with regexes.) This part causes user defined casing to not be dependent on utf8ness. The patch works by looking for three magical subroutine names while parsing in toke.c. It has long been the case that if a subroutine with one of these names exists in scope, then changing the case of a utf8 string will trigger the corresponding routine to be called. The patch causes the lexer to set a corresponding global variable if it sees any of these names. If there is a better way to do it, please let me know. The globals are actually on a per interpreter basis, though I think they could have been real globals. Prior to this patch all the core case changing functions in pp.c assumed that any utf8 string could have a user override, and so they go out to the disk tables for every character in a utf8 string (which automatically returns any overrides). But the tables for the first 256 Unicode characters are already compiled into the core, and these have been ignored for utf8 strings. This patch causes those functions to look at their corresponding global. If that global is false, the function knows for sure that there is no user-defined case changing, and can use the built-in core tables for the first 256 characters. Thus this patch will significantly speed up case changing for utf8 strings that have significant quantities of Latin1 characters without user-overrides. Note to reviewers: The code to do this table lookup has been in pp.c, but #ifdefd out. The diffs will show just the #if's removed, and not the code that suddenly has become active; it's probable that that code was not looked at thoroughly by previous reviewers. If the global is true, the case changing function knows that somewhere in the program there has been defined a user case change override. It doesn't know if the current call is in the scope of for it. So, it uses gv_fetchmeth(PL_curstash) to try to find it, without autoload. I believe that that is the correct thing to do. If it doesn't find it, then it operates as if the global were not set, and again doesn't have to go out to the disk tables for the first 256 Unicode characters. If it turns out that there is a user override case changing function in effect for this call, the function creates a mortal copy of the input, and forces it into utf8. The user-case change then gets applied as it would today on any utf8 string. I don't know if I have to do anything special concerning magic on the mortal copy, so I don't. I don't remove the mortal specially, as I believe that is done automatically. The output will always currently be in utf8 if the user-case change function gets called. This behavior could be changed so that if the input was converted to utf8, the result would be examined to see if it needs be in utf8, and if not, convert it back. I'd rather not do that. There is no change in behavior for programs that don't define any of the three magic subroutines, except for a possible speed up, as outlined above. Programs would break that do define any of those functions but are ignorant of their magical meaning and which have never have had utf8 string that a case changing function is called on. Also programs that rely on the semantic difference based on the internal representation would break. There would be no change in behavior for programs that use any of the functions for casing and make sure that all their strings are utf8. Programs that forgot this last step would suddenly start to work properly. I view this as a bug fix that should get fixed. If there is a concern about backward compatibility, this could be controlled by feature.pm. 'unicode_strings' would probably work, though I don't think its name really fits for this purpose.

Message body is not shown because it is too large.

From d37e9445b468297a2e9e265d82d24a33b72ce270 Mon Sep 17 00:00:00 2001 From: Karl Williamson <khw@khw-desktop.(none)> Date: Sat, 10 Jul 2010 16:09:52 -0600 Subject: [PATCH] mktables: Add caution comments to output tables These comments warn that changing the tables won't affect what Perl does for the first 256 code points, as their behavior is compiled into the core --- lib/unicore/mktables | 16 +++++++++++++++- 1 files changed, 15 insertions(+), 1 deletions(-) diff --git a/lib/unicore/mktables b/lib/unicore/mktables index a113114..f7fe3d1 100644 --- a/lib/unicore/mktables +++ b/lib/unicore/mktables @@ -7821,21 +7821,35 @@ sub finish_property_setup { # Perl adds this alias. $gc->add_alias('Category'); + my $core_usage = <<END; +Note that although the Perl core uses this file, it has the standard values for +code points from U+0000 to U+00FF compiled in, so changing this table will not +change the core's behavior with respect to these code points. +END + my $override = <<END; +Use user-defined case-mappings (described in perlunicode.pod) to +override this table +END + # For backwards compatibility, these property files have particular names. my $upper = property_ref('Uppercase_Mapping'); $upper->set_core_access('uc()'); $upper->set_file('Upper'); # This is what utf8.c calls it + $upper->add_comment($core_usage . $override); my $lower = property_ref('Lowercase_Mapping'); $lower->set_core_access('lc()'); $lower->set_file('Lower'); + $lower->add_comment($core_usage . $override); my $title = property_ref('Titlecase_Mapping'); $title->set_core_access('ucfirst()'); $title->set_file('Title'); + $title->add_comment($core_usage . $override); my $fold = property_ref('Case_Folding'); - $fold->set_file('Fold') if defined $fold; + $fold->set_file('Fold'); + $fold->add_comment($core_usage); # utf8.c can't currently cope with non range-size-1 for these, and even if # it were changed to do so, someone else may be using them, expecting the -- 1.5.6.3
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 01:14:48 +0100
To: public [...] khwilliamson.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 4.2k
Quoth public@khwilliamson.com (karl williamson): Show quoted text
> > Since my recent posts have been quickly answered, I thought that now > would be a good time best to push this warnocked item through as a real > patch that I want to be considered, and picked apart. As Jesse says, a > real patch can get people to come out of the woodwork. This is my best > effort with my current knowledge of the perl core to solve this problem, > but it deals in areas that I don't know much about.
I should note first that I am certainly no expert here either, these are just things that stuck out. Show quoted text
> This patch causes those functions to look at their corresponding global. > If that global is false, the function knows for sure that there is no > user-defined case changing, and can use the built-in core tables for the > first 256 characters. Thus this patch will significantly speed up case > changing for utf8 strings that have significant quantities of Latin1 > characters without user-overrides.
This worries me, a little. It seems dangerously close to reintroducing PL_sawampersand, with the associated 'DO NOT USE $&' that came a little later. Show quoted text
> If the global is true, the case changing function knows that somewhere > in the program there has been defined a user case change override. It > doesn't know if the current call is in the scope of for it. So, it uses > gv_fetchmeth(PL_curstash) to try to find it, without autoload. I > believe that that is the correct thing to do.
I'm not sure. First, the current documentation says these ToFoo subs are only valid in the main:: package. I don't know if that corresponds to current reality or not. Second, using gv_fetchmeth means that these will be looked for as methods of the current package (allowing inheritance), which I don't think is correct--at least, not unless they are also going to be called as methods. I believe the correct way to get the documented behaviour is to call get_cvs("main::ToUpper", 0). However, global overrides like this are very hard to use in a way which doesn't interfere with everything else in the program, so perhaps looking in the current package is better. (I feel I ought to note that 'ToUpper' isn't all uppercase, and if we're going to start stealing random sub names in random packages it would be better to stick to that convention; it may of course be considered too late to change that now.) This would potentially provide a way around the 'trapdoor' flags, as well. If the GVs for main::ToFoo were kept in interpreter globals (like, say, the GV for *_ is) it would make checking for the CV a cheap operation. This would remove the need for the flag. I'm not sure what the best answer is, long-term. My ideal would be for these subs to be installed somewhere with lexical scope, but since %^H doesn't allow refs this is a little tricky. This would also raise the question of an appropriate interface for installing them: something like use casefold ToUpper => sub {...}, ToLower => sub {...}; would be necessary, which is certainly quite different from the current interface. <obligatory plug for my lexical-scoping patch, sorry...> Show quoted text
> +/* If user has defined functions that override normal case mappings */ > +PERLVARI(Ihas_user_defined_uc, bool, FALSE) > +PERLVARI(Ihas_user_defined_lc, bool, FALSE) > +PERLVARI(Ihas_user_defined_tc, bool, FALSE)
Is there some good reason not to make this a bitfield instead? It seems silly to use three ints where one would do. Show quoted text
> @@ -7606,12 +7607,52 @@ Perl_yylex(pTHX) > if (PL_madskills) > nametoke = newSVpvn(s, d - s); > #endif > - if (memchr(tmpbuf, ':', len)) > + if (memchr(tmpbuf, ':', len)) { > + const char* const colon = ":"; > sv_setpvn(PL_subname, tmpbuf, len);
<snip> Show quoted text
> + > + > + /* The three subroutines 'ToUpper', 'ToLower', and > + * 'ToTitle' are special. Set a flag for each one we've > + * seen */ > + if (len == 7 && base_subname[0] == 'T') {
Doing this here means that only an explicit 'sub ToUpper' will be recognized: something like *ToUpper = sub {...}; will end up ignored (unless the flag is set already for some reason). As a Perl programmer I would find this *very* confusing. I don't know where the best place to put the check is, though: perhaps gv_fetchpvn_flags, which already catches accesses to a whole lot of specially-named globs? Ben
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sat, 10 Jul 2010 22:27:17 -0600
To: Ben Morrow <ben [...] morrow.me.uk>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 5.9k
Ben Morrow wrote: Show quoted text
> Quoth public@khwilliamson.com (karl williamson):
>> Since my recent posts have been quickly answered, I thought that now >> would be a good time best to push this warnocked item through as a real >> patch that I want to be considered, and picked apart. As Jesse says, a >> real patch can get people to come out of the woodwork. This is my best >> effort with my current knowledge of the perl core to solve this problem, >> but it deals in areas that I don't know much about.
> > I should note first that I am certainly no expert here either, these are > just things that stuck out.
Thanks for your informative response. Just a couple things that I can respond to now Show quoted text
>
>> This patch causes those functions to look at their corresponding global. >> If that global is false, the function knows for sure that there is no >> user-defined case changing, and can use the built-in core tables for the >> first 256 characters. Thus this patch will significantly speed up case >> changing for utf8 strings that have significant quantities of Latin1 >> characters without user-overrides.
> > This worries me, a little. It seems dangerously close to reintroducing > PL_sawampersand, with the associated 'DO NOT USE $&' that came a little > later.
Except that currently the penalty is already there, and this removes it for 99.999...%. I doubt that this construct is often used, and when it is I imagine that there's no way around it, so people will have to pay the penalty. Show quoted text
>
>> If the global is true, the case changing function knows that somewhere >> in the program there has been defined a user case change override. It >> doesn't know if the current call is in the scope of for it. So, it uses >> gv_fetchmeth(PL_curstash) to try to find it, without autoload. I >> believe that that is the correct thing to do.
> > I'm not sure. First, the current documentation says these ToFoo subs are > only valid in the main:: package. I don't know if that corresponds to > current reality or not.
You're looking at old documentation. I changed it in 5.12 to correspond to the reality that they are package level. I don't know how that affects your get_cvs() recommendation below. Second, using gv_fetchmeth means that these will Show quoted text
> be looked for as methods of the current package (allowing inheritance), > which I don't think is correct--at least, not unless they are also going > to be called as methods. > > I believe the correct way to get the documented behaviour is to call > get_cvs("main::ToUpper", 0). However, global overrides like this are > very hard to use in a way which doesn't interfere with everything else > in the program, so perhaps looking in the current package is better. (I > feel I ought to note that 'ToUpper' isn't all uppercase, and if we're > going to start stealing random sub names in random packages it would be > better to stick to that convention; it may of course be considered too > late to change that now.)
Since these names have been out there for quite some years, I think it's too late. Show quoted text
> > This would potentially provide a way around the 'trapdoor' flags, as > well. If the GVs for main::ToFoo were kept in interpreter globals (like, > say, the GV for *_ is) it would make checking for the CV a cheap > operation. This would remove the need for the flag. > > I'm not sure what the best answer is, long-term. My ideal would be for > these subs to be installed somewhere with lexical scope, but since %^H > doesn't allow refs this is a little tricky. This would also raise the > question of an appropriate interface for installing them: something like > > use casefold > ToUpper => sub {...}, > ToLower => sub {...}; > > would be necessary, which is certainly quite different from the current > interface. > > <obligatory plug for my lexical-scoping patch, sorry...> >
>> +/* If user has defined functions that override normal case mappings */ >> +PERLVARI(Ihas_user_defined_uc, bool, FALSE) >> +PERLVARI(Ihas_user_defined_lc, bool, FALSE) >> +PERLVARI(Ihas_user_defined_tc, bool, FALSE)
> > Is there some good reason not to make this a bitfield instead? It seems > silly to use three ints where one would do.
I was just trying to make it as fast as possible. No other reason. Show quoted text
>
>> @@ -7606,12 +7607,52 @@ Perl_yylex(pTHX) >> if (PL_madskills) >> nametoke = newSVpvn(s, d - s); >> #endif >> - if (memchr(tmpbuf, ':', len)) >> + if (memchr(tmpbuf, ':', len)) { >> + const char* const colon = ":"; >> sv_setpvn(PL_subname, tmpbuf, len);
> <snip>
>> + >> + >> + /* The three subroutines 'ToUpper', 'ToLower', and >> + * 'ToTitle' are special. Set a flag for each one we've >> + * seen */ >> + if (len == 7 && base_subname[0] == 'T') {
> > Doing this here means that only an explicit 'sub ToUpper' will be > recognized: something like > > *ToUpper = sub {...};
Good point. Show quoted text
> > will end up ignored (unless the flag is set already for some reason). As > a Perl programmer I would find this *very* confusing. I don't know where > the best place to put the check is, though: perhaps gv_fetchpvn_flags, > which already catches accesses to a whole lot of specially-named globs?
I hope someone knows who reads this. Show quoted text
> > Ben >
I think this language feature was ill-advised, but it is there and has been there for a long time, and I think therefore we have to support it. It is useful in the Turkish and Azerii languages due to the Unicode standard being very problematic (to be euphemistic about it) in those areas. I don't know where else it is useful. Because of its current implementation, all utf8 casing suffers a performance penalty which this patch alleviates. I don't think we need to go to a lot of trouble, inventing new mechanisms to support this. And there is a work-around, which I only recently learned enough Perl to know about. And that is to install your own uc, lc, ... functions globally and then you can do anything you want. Thanks for educating me a little on the core, and hopefully I'll get some more feedback.
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 12:45:12 +0100
To: public [...] khwilliamson.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 4.2k
Quoth public@khwilliamson.com (karl williamson): Show quoted text
> Ben Morrow wrote:
> > Quoth public@khwilliamson.com (karl williamson): > >
> >> This patch causes those functions to look at their corresponding global. > >> If that global is false, the function knows for sure that there is no > >> user-defined case changing, and can use the built-in core tables for the > >> first 256 characters. Thus this patch will significantly speed up case > >> changing for utf8 strings that have significant quantities of Latin1 > >> characters without user-overrides.
> > > > This worries me, a little. It seems dangerously close to reintroducing > > PL_sawampersand, with the associated 'DO NOT USE $&' that came a little > > later.
> > Except that currently the penalty is already there, and this removes it > for 99.999...%. I doubt that this construct is often used, and when it > is I imagine that there's no way around it, so people will have to pay > the penalty.
OK. Show quoted text
> >> If the global is true, the case changing function knows that somewhere > >> in the program there has been defined a user case change override. It > >> doesn't know if the current call is in the scope of for it. So, it uses > >> gv_fetchmeth(PL_curstash) to try to find it, without autoload. I > >> believe that that is the correct thing to do.
> > > > I'm not sure. First, the current documentation says these ToFoo subs are > > only valid in the main:: package. I don't know if that corresponds to > > current reality or not.
> > You're looking at old documentation. I changed it in 5.12 to correspond > to the reality that they are package level. I don't know how that > affects your get_cvs() recommendation below.
OK, sorry. In that case I think you want GV *gv; CV *cv; if ( (gv = gv_fetchpvs("ToUpper", GV_NOADD_NOINIT|GV_NOTQUAL)) && isGV_with_GP(gv) && (cv = GvCV(gv)) ) { /* we have a sub to call */ } Show quoted text
> > in the program, so perhaps looking in the current package is better. (I > > feel I ought to note that 'ToUpper' isn't all uppercase, and if we're > > going to start stealing random sub names in random packages it would be > > better to stick to that convention; it may of course be considered too > > late to change that now.)
> > Since these names have been out there for quite some years, I think it's > too late.
OK. Show quoted text
> > [toke.c] > > Doing this here means that only an explicit 'sub ToUpper' will be > > recognized: something like > > > > *ToUpper = sub {...};
> > Good point.
> > > > will end up ignored (unless the flag is set already for some reason). As > > a Perl programmer I would find this *very* confusing. I don't know where > > the best place to put the check is, though: perhaps gv_fetchpvn_flags, > > which already catches accesses to a whole lot of specially-named globs?
> > I hope someone knows who reads this.
Looking at the code again, gv_fetchpvn_flags is where PL_sawampersand gets set, so I think is *is* the right place. Show quoted text
> I think this language feature was ill-advised, but it is there and has > been there for a long time, and I think therefore we have to support it. > It is useful in the Turkish and Azerii languages due to the Unicode > standard being very problematic (to be euphemistic about it) in those > areas. I don't know where else it is useful.
I don't know. It seems to me like the sort of thing that could be very useful under the right circumstances, but which I would never dare to use for fear of breaking something else in the program. A lexically- scoped alternative would, I feel, be the right answer. However, if you want to say you are only going to fix existing bugs, not introduce new features, that's a perfectly sensible restriction of scope :). Show quoted text
> Because of its current implementation, all utf8 casing suffers a > performance penalty which this patch alleviates. > > I don't think we need to go to a lot of trouble, inventing new > mechanisms to support this. And there is a work-around, which I only > recently learned enough Perl to know about. And that is to install your > own uc, lc, ... functions globally and then you can do anything you want.
AFAIK that only works for explicit calls to lc and uc, though. This casefold mechanism also applies to "\L" and (I presume) to the case-smashing done by m//i. Ben
CC: public [...] khwilliamson.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 14:11:36 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 635b
On 11 July 2010 13:45, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> Quoth public@khwilliamson.com (karl williamson):
>> Ben Morrow wrote:
>> > Quoth public@khwilliamson.com (karl williamson):
> AFAIK that only works for explicit calls to lc and uc, though. This > casefold mechanism also applies to "\L" and (I presume) to the > case-smashing done by m//i.
Im guessing it doesnt apply to m//i. Case insensitive matching doesnt use "uppercase" and "lowercase" forms (which are less than useful in many writing systems), it uses "foldcase" form, which is quite different. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 07:39:22 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 725b
demerphq wrote: Show quoted text
> On 11 July 2010 13:45, Ben Morrow <ben@morrow.me.uk> wrote:
>> Quoth public@khwilliamson.com (karl williamson):
>>> Ben Morrow wrote:
>>>> Quoth public@khwilliamson.com (karl williamson):
>> AFAIK that only works for explicit calls to lc and uc, though. This >> casefold mechanism also applies to "\L" and (I presume) to the >> case-smashing done by m//i.
> > Im guessing it doesnt apply to m//i.
Your guess is correct. And \L and cousins do call the same routines as lc and its cousins do. Show quoted text
> > Case insensitive matching doesnt use "uppercase" and "lowercase" forms > (which are less than useful in many writing systems), it uses > "foldcase" form, which is quite different. > > cheers, > Yves > >
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 07:47:35 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 879b
karl williamson wrote: Show quoted text
> demerphq wrote:
>> On 11 July 2010 13:45, Ben Morrow <ben@morrow.me.uk> wrote:
>>> Quoth public@khwilliamson.com (karl williamson):
>>>> Ben Morrow wrote:
>>>>> Quoth public@khwilliamson.com (karl williamson):
>>> AFAIK that only works for explicit calls to lc and uc, though. This >>> casefold mechanism also applies to "\L"
Oops I didn't catch your drift. I'll have to check into that, to fix the documentation thats in 5.13. and (I presume) to the Show quoted text
>>> case-smashing done by m//i.
>> >> Im guessing it doesnt apply to m//i.
> > Your guess is correct. And \L and cousins do call the same routines as > lc and its cousins do.
>> >> Case insensitive matching doesnt use "uppercase" and "lowercase" forms >> (which are less than useful in many writing systems), it uses >> "foldcase" form, which is quite different. >> >> cheers, >> Yves >> >>
>
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Sun, 11 Jul 2010 22:01:03 -0400
To: karl williamson <public [...] khwilliamson.com>
From: David Golden <xdaveg [...] gmail.com>
On Sun, Jul 11, 2010 at 12:27 AM, karl williamson <public@khwilliamson.com> wrote: Show quoted text
>> I believe the correct way to get the documented behaviour is to call >> get_cvs("main::ToUpper", 0). However, global overrides like this are >> very hard to use in a way which doesn't interfere with everything else >> in the program, so perhaps looking in the current package is better. (I >> feel I ought to note that 'ToUpper' isn't all uppercase, and if we're >> going to start stealing random sub names in random packages it would be >> better to stick to that convention; it may of course be considered too >> late to change that now.)
> > Since these names have been out there for quite some years, I think it's too > late.
Why not deprecate it and introduce something with lexical scoping? If we're going to fix something that was poorly designed/considered in the first way, let's see about going all the way and coming up with a better design. (cue the caution not to let the "perfect" solution stand in the way of a "good" solution) -- David
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Mon, 12 Jul 2010 18:11:23 -0600
To: David Golden <xdaveg [...] gmail.com>
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.1k
David Golden wrote: Show quoted text
> On Sun, Jul 11, 2010 at 12:27 AM, karl williamson > <public@khwilliamson.com> wrote:
>>> I believe the correct way to get the documented behaviour is to call >>> get_cvs("main::ToUpper", 0). However, global overrides like this are >>> very hard to use in a way which doesn't interfere with everything else >>> in the program, so perhaps looking in the current package is better. (I >>> feel I ought to note that 'ToUpper' isn't all uppercase, and if we're >>> going to start stealing random sub names in random packages it would be >>> better to stick to that convention; it may of course be considered too >>> late to change that now.)
>> Since these names have been out there for quite some years, I think it's too >> late.
> > Why not deprecate it and introduce something with lexical scoping? > > If we're going to fix something that was poorly designed/considered in > the first way, let's see about going all the way and coming up with a > better design. (cue the caution not to let the "perfect" solution > stand in the way of a "good" solution) >
I agree, but wonder how much work is it going to be. I don't understand the implications very well of Ben's patch that just got applied. Does it help make this easier, along the lines he outlined in his first email? And, until the existing behavior is removed, or something like my patch is applied, all utf8 strings pay the performance penalty. Ben's right about \L not being affected by overriding lc(). I expect there is no way to do this, which means context-sensitive casing is not fully available. And context-sensitive casing is problematic anyway, because one never knows if one got sufficient context to make the right choice, or if concatenating or taking substrings will later make the right choice now, the wrong choice then. In the case of Turkish, the context sensitivity applies only to modifier characters which should never be split apart from their base character anyway. Another use I can think of for the facility is to have cased private use characters. Anyway, I took Ben's suggestions, and they all work, so I could submit another patch with them, but will wait to see what responses this post gets.
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Mon, 12 Jul 2010 22:17:19 -0400
To: Karl Williamson <public [...] khwilliamson.com>
From: David Golden <xdaveg [...] gmail.com>
Download (untitled) / with headers
text/plain 778b
On Mon, Jul 12, 2010 at 8:11 PM, Karl Williamson <public@khwilliamson.com> wrote: Show quoted text
>> If we're going to fix something that was poorly designed/considered in >> the first way, let's see about going all the way and coming up with a >> better design.  (cue the caution not to let the "perfect" solution >> stand in the way of a "good" solution) >>
> > I agree, but wonder how much work is it going to be.  I don't understand the > implications very well of Ben's patch that just got applied.  Does it help > make this easier, along the lines he outlined in his first email?
Or let me make another suggestion -- move it to something like CORE::GLOBAL::blah so at least people realize that they're mucking with a global and have the option to localize it or something. -- David
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Tue, 13 Jul 2010 22:07:08 +0100
To: public [...] khwilliamson.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 3.4k
Quoth public@khwilliamson.com (Karl Williamson): Show quoted text
> David Golden wrote: >
[ main::ToUpper &c. ] Show quoted text
> > > > Why not deprecate it and introduce something with lexical scoping? > > > > If we're going to fix something that was poorly designed/considered in > > the first way, let's see about going all the way and coming up with a > > better design. (cue the caution not to let the "perfect" solution > > stand in the way of a "good" solution) > >
> > I agree, but wonder how much work is it going to be. I don't understand > the implications very well of Ben's patch that just got applied. Does > it help make this easier, along the lines he outlined in his first email?
First, I would hate for this discussion of a potential new feature to distract you from the work you're currently doing. The patch you submitted is a definite improvement on what was there before, and IMHO it (or something like it) should go in. Having looked a little more at implementing this lexically, it's not as straightforward as I'd thought (there's a surprise). My patch makes it easy to scope things lexically at *compile* time; this would need lexical scoping at *run* time, which is somewhat harder. One reason for this is that there are only a few ways to exit a scope at compile time, but many different ways at run time; another is that perl doesn't really understand about scopes smaller than a whole sub at run time. Show quoted text
> And, until the existing behavior is removed, or something like my patch > is applied, all utf8 strings pay the performance penalty. > > Ben's right about \L not being affected by overriding lc(). I expect > there is no way to do this, which means context-sensitive casing is not > fully available.
Overriding CORE::GLOBAL::readpipe seems to manage to correctly override the implementation of backticks. I'm not very familiar with that bit of toke.c any more, but it ought to be possible to treat \LUu the same way, and pick up a CORE::GLOBAL override at compile time. This would also reduce the problem of lexically scoping these overrides to a compile-time problem, which (as above) makes it much more tractable. Another point is that Yves said that m//i doesn't currently honour any of these overrides. If we're going to improve case-folding (rather than simply fixing bugs and optimising existing behaviour) it would certainly be helpful to have some way to override this. I don't know enough about what m//i does to know what would be required: would a fourth ToFold override be sufficient? Show quoted text
> And context-sensitive casing is problematic anyway, > because one never knows if one got sufficient context to make the right > choice, or if concatenating or taking substrings will later make the > right choice now, the wrong choice then. In the case of Turkish, the > context sensitivity applies only to modifier characters which should > never be split apart from their base character anyway.
I feel the important point to remember in this sort of discussion is 'Perl has polymorphic values, but monomorphic operators, since having both is too confusing'. IMHO most of the problems with Perl's Unicode support can be traced back to forgetting this design decision, and allowing the flags on the string to affect the operators (rather than the other way around). A properly-scoped lc override puts the onus on the user to know whether their strings should be treated as Turkish or not, just as they need to know whether they should be using == or eq. Ben
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Tue, 13 Jul 2010 18:25:49 -0600
To: Ben Morrow <ben [...] morrow.me.uk>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.4k
Ben Morrow wrote: Show quoted text
> Quoth public@khwilliamson.com (Karl Williamson):
>> David Golden wrote: >>
> [ main::ToUpper &c. ]
>>> Why not deprecate it and introduce something with lexical scoping? >>> >>> If we're going to fix something that was poorly designed/considered in >>> the first way, let's see about going all the way and coming up with a >>> better design. (cue the caution not to let the "perfect" solution >>> stand in the way of a "good" solution) >>>
>> I agree, but wonder how much work is it going to be. I don't understand >> the implications very well of Ben's patch that just got applied. Does >> it help make this easier, along the lines he outlined in his first email?
> > First, I would hate for this discussion of a potential new feature to > distract you from the work you're currently doing. The patch you > submitted is a definite improvement on what was there before, and IMHO > it (or something like it) should go in. > > Having looked a little more at implementing this lexically, it's not as > straightforward as I'd thought (there's a surprise). My patch makes it > easy to scope things lexically at *compile* time; this would need > lexical scoping at *run* time, which is somewhat harder. One reason for > this is that there are only a few ways to exit a scope at compile time, > but many different ways at run time; another is that perl doesn't really > understand about scopes smaller than a whole sub at run time. >
>> And, until the existing behavior is removed, or something like my patch >> is applied, all utf8 strings pay the performance penalty. >> >> Ben's right about \L not being affected by overriding lc(). I expect >> there is no way to do this, which means context-sensitive casing is not >> fully available.
> > Overriding CORE::GLOBAL::readpipe seems to manage to correctly override > the implementation of backticks. I'm not very familiar with that bit of > toke.c any more, but it ought to be possible to treat \LUu the same way, > and pick up a CORE::GLOBAL override at compile time. This would also > reduce the problem of lexically scoping these overrides to a > compile-time problem, which (as above) makes it much more tractable.
I don't understand much about that at all, but I found the following in toke.c: if (*s == 'l') NEXTVAL_NEXTTOKE.ival = OP_LCFIRST; else if (*s == 'u') NEXTVAL_NEXTTOKE.ival = OP_UCFIRST; else if (*s == 'L') NEXTVAL_NEXTTOKE.ival = OP_LC; else if (*s == 'U') NEXTVAL_NEXTTOKE.ival = OP_UC; else if (*s == 'Q') NEXTVAL_NEXTTOKE.ival = OP_QUOTEMETA; But, my experiments apparently showed that this OP_LC is not somehow overridden by the user's. Show quoted text
> > Another point is that Yves said that m//i doesn't currently honour any > of these overrides. If we're going to improve case-folding (rather than > simply fixing bugs and optimising existing behaviour) it would certainly > be helpful to have some way to override this. I don't know enough about > what m//i does to know what would be required: would a fourth ToFold > override be sufficient?
My experience is that the Unicode case-folding is the most broken part of Unicode handling. I gave up trying to write TODO test scripts because there were just too many things wrong. And they're hard to fix, as the design is flawed. Yves has said that he had a plan to redo the whole thing as a trie, and I let him know already that this kind of thing might be an issue, and he responded that he would keep that in mind. Show quoted text
>
>> And context-sensitive casing is problematic anyway, >> because one never knows if one got sufficient context to make the right >> choice, or if concatenating or taking substrings will later make the >> right choice now, the wrong choice then. In the case of Turkish, the >> context sensitivity applies only to modifier characters which should >> never be split apart from their base character anyway.
> > I feel the important point to remember in this sort of discussion is > 'Perl has polymorphic values, but monomorphic operators, since having > both is too confusing'. IMHO most of the problems with Perl's Unicode > support can be traced back to forgetting this design decision, and > allowing the flags on the string to affect the operators (rather than > the other way around). A properly-scoped lc override puts the onus on > the user to know whether their strings should be treated as Turkish or > not, just as they need to know whether they should be using == or eq. > > Ben >
CC: public [...] khwilliamson.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 09:26:03 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
On 13 July 2010 23:07, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> Quoth public@khwilliamson.com (Karl Williamson):
>> David Golden wrote: >>
> [ main::ToUpper &c. ]
>> > >> > Why not deprecate it and introduce something with lexical scoping? >> > >> > If we're going to fix something that was poorly designed/considered in >> > the first way, let's see about going all the way and coming up with a >> > better design.  (cue the caution not to let the "perfect" solution >> > stand in the way of a "good" solution) >> >
>> >> I agree, but wonder how much work is it going to be.  I don't understand >> the implications very well of Ben's patch that just got applied.  Does >> it help make this easier, along the lines he outlined in his first email?
> > First, I would hate for this discussion of a potential new feature to > distract you from the work you're currently doing. The patch you > submitted is a definite improvement on what was there before, and IMHO > it (or something like it) should go in. > > Having looked a little more at implementing this lexically, it's not as > straightforward as I'd thought (there's a surprise). My patch makes it > easy to scope things lexically at *compile* time; this would need > lexical scoping at *run* time, which is somewhat harder. One reason for > this is that there are only a few ways to exit a scope at compile time, > but many different ways at run time; another is that perl doesn't really > understand about scopes smaller than a whole sub at run time. >
>> And, until the existing behavior is removed, or something like my patch >> is applied, all utf8 strings pay the performance penalty. >> >> Ben's right about \L not being affected by overriding lc().  I expect >> there is no way to do this, which means context-sensitive casing is not >> fully available.
> > Overriding CORE::GLOBAL::readpipe seems to manage to correctly override > the implementation of backticks. I'm not very familiar with that bit of > toke.c any more, but it ought to be possible to treat \LUu the same way, > and pick up a CORE::GLOBAL override at compile time. This would also > reduce the problem of lexically scoping these overrides to a > compile-time problem, which (as above) makes it much more tractable. > > Another point is that Yves said that m//i doesn't currently honour any > of these overrides. If we're going to improve case-folding (rather than > simply fixing bugs and optimising existing behaviour) it would certainly > be helpful to have some way to override this. I don't know enough about > what m//i does to know what would be required: would a fourth ToFold > override be sufficient?
Fold case is basically the longest equivalent of a given character. So for instance the foldcase of \xDF is "ss", however arguably (the latest unicode version and apparently Austria might say differently) the uc of \xDF is \xDF and the lc of \xDF is also \xDF. So basically the interface would have to support casing from a single character to multiple. Im assuming we already suport this for uc/lc. I havent dug, and so forgive me if i ask a stupid question, do we support a distinct "titlecase" mode? Do we have primitives for it? Do we expose foldcasing? Do we expose support for canonicalization? I suspect we should have some new primitives, fc() (fold case) and cc() (canonical-case) and tc() (title case - uc != tc in many languages). Show quoted text
>> And context-sensitive casing is problematic anyway, >> because one never knows if one got sufficient context to make the right >> choice, or if concatenating or taking substrings will later make the >> right choice now, the wrong choice then.  In the case of Turkish, the >> context sensitivity applies only to modifier characters which should >> never be split apart from their base character anyway.
> > I feel the important point to remember in this sort of discussion is > 'Perl has polymorphic values, but monomorphic operators, since having > both is too confusing'. IMHO most of the problems with Perl's Unicode > support can be traced back to forgetting this design decision, and > allowing the flags on the string to affect the operators (rather than > the other way around).
If i read you write you are suggesting we should have added new operators for string comparison? Im trying to work out this would work in practice. Lets suppose we only add one new operator for each operation, based on the cmp/<=> interface. So we would need bcmp : compare two strings at the binary level irrespective of how they are encoded. ucmp : compare two strings at the codepoint level iucmp : compare two strings case insensitively at the codepoint level using unicode semantics Is this what you mean? Show quoted text
> A properly-scoped lc override puts the onus on > the user to know whether their strings should be treated as Turkish or > not, just as they need to know whether they should be using == or eq.
This makes sense, but is problematic at the level that the holy grail is to make "strings just work". However I think maybe recognizing that one cannot make "strings just work" is a good thing... Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 09:33:13 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 1.7k
On 14 July 2010 02:25, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> Ben Morrow wrote:
>> >> Quoth public@khwilliamson.com (Karl Williamson): >> Another point is that Yves said that m//i doesn't currently honour any >> of these overrides. If we're going to improve case-folding (rather than >> simply fixing bugs and optimising existing behaviour) it would certainly >> be helpful to have some way to override this. I don't know enough about >> what m//i does to know what would be required: would a fourth ToFold >> override be sufficient?
> > My experience is that the Unicode case-folding is the most broken part of > Unicode handling.  I gave up trying to write TODO test scripts because there > were just too many things wrong.  And they're hard to fix, as the design is > flawed.  Yves has said that he had a plan to redo the whole thing as a trie, > and I let him know already that this kind of thing might be an issue, and he > responded that he would keep that in mind.
My thoughts concerned performance only. My thinking was that there was a clever way to store the folding info so that we could improve unicode case insensitive matching. I started working on it, discovered a flaw in my plans which seemed like a show stopper, and I haven't returned to it. Lexically scoped casing rules would definitely not improve the performance issues. Mostly fold casing seems to make sense to me, except that it is inefficient, can you refresh me, and the list as to the really problematic cases? Note that our implementation has issues related to how we model sequences of characters (for instance EXACT/ANYOF etc) but the underlying premise of fold-casing doesn't seem to me to be intrinsically broken.... cheers, yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 12:37:46 +0100
To: demerphq [...] gmail.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 3.4k
Quoth demerphq@gmail.com (demerphq): Show quoted text
> > Fold case is basically the longest equivalent of a given character. > > So for instance the foldcase of \xDF is "ss", however arguably (the > latest unicode version and apparently Austria might say differently) > the uc of \xDF is \xDF and the lc of \xDF is also \xDF. > > So basically the interface would have to support casing from a single > character to multiple. Im assuming we already suport this for uc/lc. > > I havent dug, and so forgive me if i ask a stupid question, do we > support a distinct "titlecase" mode? Do we have primitives for it? Do > we expose foldcasing? Do we expose support for canonicalization?
ucfirst and \u perform titlecasing for one character only (the rest of the string is lowercased). AFAIK that's the only interface. Show quoted text
> >> And context-sensitive casing is problematic anyway, > >> because one never knows if one got sufficient context to make the right > >> choice, or if concatenating or taking substrings will later make the > >> right choice now, the wrong choice then. \xA0In the case of Turkish, the > >> context sensitivity applies only to modifier characters which should > >> never be split apart from their base character anyway.
> > > > I feel the important point to remember in this sort of discussion is > > 'Perl has polymorphic values, but monomorphic operators, since having > > both is too confusing'. IMHO most of the problems with Perl's Unicode > > support can be traced back to forgetting this design decision, and > > allowing the flags on the string to affect the operators (rather than > > the other way around).
> > If i read you write you are suggesting we should have added new > operators for string comparison? > > Im trying to work out this would work in practice. Lets suppose we > only add one new operator for each operation, based on the cmp/<=> > interface.
At the time it was considered that a lexical pragma to switch from one version of cmp to another was the least bad solution. (I pretty-much agree, here.) So $a cmp $b # coerces $a and $b to default semantics, whatever # they may be (probably Unicode). { use bytes; # coerces $a and $b to byte semantics. IMHO the fact $a cmp $b # this has never worked is a bug: it should have } # explicitly downgraded strings (including replacing # characters >255). Probably the existing pragmas cannot be reused for this, so something like use strings "unicode"; use strings "bytes"; use strings "locale"; use strings casefold => uc => sub { ... }, lc => sub { ... }; would be necessary. Show quoted text
> So we would need > > bcmp : compare two strings at the binary level irrespective of how > they are encoded. > ucmp : compare two strings at the codepoint level > iucmp : compare two strings case insensitively at the codepoint level > using unicode semantics > > Is this what you mean?
Effectively, but without introducing a whole lot of new operator names. Show quoted text
> > A properly-scoped lc override puts the onus on > > the user to know whether their strings should be treated as Turkish or > > not, just as they need to know whether they should be using == or eq.
> > This makes sense, but is problematic at the level that the holy grail > is to make "strings just work". However I think maybe recognizing that > one cannot make "strings just work" is a good thing...
+1. Sane, comprehensible semantics are more use than an attempt to DWIW with bizarre side-effects. Ben
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 14:45:23 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
On 14 July 2010 13:37, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> Quoth demerphq@gmail.com (demerphq):
>> >> Fold case is basically the longest equivalent of a given character. >> >> So for instance the foldcase of \xDF is "ss", however arguably (the >> latest unicode version and apparently Austria might say differently) >> the uc of \xDF is \xDF and the lc of \xDF is also \xDF. >> >> So basically the interface would have to support casing from a single >> character to multiple. Im assuming we already suport this for uc/lc. >> >> I havent dug, and so forgive me if i ask a stupid question, do we >> support a distinct "titlecase" mode? Do we have primitives for it? Do >> we expose foldcasing? Do we expose support for canonicalization?
> > ucfirst and \u perform titlecasing for one character only (the rest of > the string is lowercased). AFAIK that's the only interface.
Ah, but thats the thing, they dont perform /titlecasing/ they provide "ucfirst", it just happens to be the case that in most western European languages that the two are the same. In many scripts they aren't. Show quoted text
>> >> And context-sensitive casing is problematic anyway, >> >> because one never knows if one got sufficient context to make the right >> >> choice, or if concatenating or taking substrings will later make the >> >> right choice now, the wrong choice then.  In the case of Turkish, the >> >> context sensitivity applies only to modifier characters which should >> >> never be split apart from their base character anyway.
>> > >> > I feel the important point to remember in this sort of discussion is >> > 'Perl has polymorphic values, but monomorphic operators, since having >> > both is too confusing'. IMHO most of the problems with Perl's Unicode >> > support can be traced back to forgetting this design decision, and >> > allowing the flags on the string to affect the operators (rather than >> > the other way around).
>> >> If i read you write you are suggesting we should have added new >> operators for string comparison? >> >> Im trying to work out this would work in practice. Lets suppose we >> only add one new operator for each operation, based on the cmp/<=> >> interface.
> > At the time it was considered that a lexical pragma to switch from one > version of cmp to another was the least bad solution. (I pretty-much > agree, here.) So > >    $a cmp $b       # coerces $a and $b to default semantics, whatever >                    # they may be (probably Unicode). > >    { use bytes;    # coerces $a and $b to byte semantics. IMHO the fact >        $a cmp $b   # this has never worked is a bug: it should have >    }               # explicitly downgraded strings (including replacing >                    # characters >255). > > Probably the existing pragmas cannot be reused for this, so something > like > >    use strings "unicode"; >    use strings "bytes"; >    use strings "locale"; >    use strings casefold => >        uc => sub { ... }, >        lc => sub { ... }; > > would be necessary.
I get you. Show quoted text
>> So we would need >> >> bcmp : compare two strings at the binary level irrespective of how >> they are encoded. >> ucmp : compare two strings at the codepoint level >> iucmp : compare two strings case insensitively at the codepoint level >> using unicode semantics >> >> Is this what you mean?
> > Effectively, but without introducing a whole lot of new operator names.
Hmm, i think id be happier with the operators. But i can see what you mean. Show quoted text
>> > A properly-scoped lc override puts the onus on >> > the user to know whether their strings should be treated as Turkish or >> > not, just as they need to know whether they should be using == or eq.
>> >> This makes sense, but is problematic at the level that the holy grail >> is to make "strings just work". However I think maybe recognizing that >> one cannot make "strings just work" is a good thing...
> > +1. Sane, comprehensible semantics are more use than an attempt to DWIW > with bizarre side-effects.
Well put. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, public [...] khwilliamson.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:46:19 -0400
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 628b
On Wed, Jul 14, 2010 at 3:26 AM, demerphq <demerphq@gmail.com> wrote: Show quoted text
> Do we expose support for canonicalization? >
I think Unicode::Normalize uses Perl's tables, so I think so. Show quoted text
> So we would need > > bcmp : compare two strings at the binary level irrespective of how > they are encoded. > ucmp : compare two strings at the codepoint level > iucmp : compare two strings case insensitively at the codepoint level > using unicode semantics >
What about comparing grapheme equivalence? For example, e acute vs e + combining acute (Canonical equivalence) or superscript 2 vs plain 2 (Compatibility equivalence). - Eric Brine
CC: demerphq [...] gmail.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:23:43 -0600
To: Ben Morrow <ben [...] morrow.me.uk>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 4.4k
Ben Morrow wrote: Show quoted text
> Quoth demerphq@gmail.com (demerphq):
>> Fold case is basically the longest equivalent of a given character. >> >> So for instance the foldcase of \xDF is "ss", however arguably (the >> latest unicode version and apparently Austria might say differently) >> the uc of \xDF is \xDF and the lc of \xDF is also \xDF.
I don't understand what you're saying. All versions of Unicode say the upper case of \xDF is SS, and the titlecase is Ss, though it says the titlecase should never occur in practice. I didn't understand someone's explanation for that. Show quoted text
>> >> So basically the interface would have to support casing from a single >> character to multiple. Im assuming we already suport this for uc/lc.
We do. Show quoted text
>> >> I havent dug, and so forgive me if i ask a stupid question, do we >> support a distinct "titlecase" mode? Do we have primitives for it? Do >> we expose foldcasing? Do we expose support for canonicalization?
> > ucfirst and \u perform titlecasing for one character only (the rest of > the string is lowercased). AFAIK that's the only interface.
I know of no other interface. Here's what Unicode says about titlecase: "Because of the inclusion of certain composite characters for compatibility, such as U+01F1 latin capital letter dz, a third case, called titlecase, is used where the first character of a word must be capitalized. An example of such a character is U+01F2 latin capital letter d with small letter z. The three case forms are UPPERCASE, Titlecase, and lowercase.... titlecase distinctions apply only to a handful of compatibility characters." Note there aren't many titlecase characters. Foldcase is only exposed through //i. Canonicalization is only supported by Unicode::Normalize. Show quoted text
>
>>>> And context-sensitive casing is problematic anyway, >>>> because one never knows if one got sufficient context to make the right >>>> choice, or if concatenating or taking substrings will later make the >>>> right choice now, the wrong choice then. �In the case of Turkish, the >>>> context sensitivity applies only to modifier characters which should >>>> never be split apart from their base character anyway.
>>> I feel the important point to remember in this sort of discussion is >>> 'Perl has polymorphic values, but monomorphic operators, since having >>> both is too confusing'. IMHO most of the problems with Perl's Unicode >>> support can be traced back to forgetting this design decision, and >>> allowing the flags on the string to affect the operators (rather than >>> the other way around).
>> If i read you write you are suggesting we should have added new >> operators for string comparison? >> >> Im trying to work out this would work in practice. Lets suppose we >> only add one new operator for each operation, based on the cmp/<=> >> interface.
> > At the time it was considered that a lexical pragma to switch from one > version of cmp to another was the least bad solution. (I pretty-much > agree, here.) So > > $a cmp $b # coerces $a and $b to default semantics, whatever > # they may be (probably Unicode). > > { use bytes; # coerces $a and $b to byte semantics. IMHO the fact > $a cmp $b # this has never worked is a bug: it should have > } # explicitly downgraded strings (including replacing > # characters >255). > > Probably the existing pragmas cannot be reused for this, so something > like > > use strings "unicode"; > use strings "bytes"; > use strings "locale"; > use strings casefold => > uc => sub { ... }, > lc => sub { ... }; > > would be necessary. >
>> So we would need >> >> bcmp : compare two strings at the binary level irrespective of how >> they are encoded. >> ucmp : compare two strings at the codepoint level >> iucmp : compare two strings case insensitively at the codepoint level >> using unicode semantics >> >> Is this what you mean?
> > Effectively, but without introducing a whole lot of new operator names. >
>>> A properly-scoped lc override puts the onus on >>> the user to know whether their strings should be treated as Turkish or >>> not, just as they need to know whether they should be using == or eq.
>> This makes sense, but is problematic at the level that the holy grail >> is to make "strings just work". However I think maybe recognizing that >> one cannot make "strings just work" is a good thing...
> > +1. Sane, comprehensible semantics are more use than an attempt to DWIW > with bizarre side-effects. > > Ben >
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:39:06 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 2.9k
demerphq wrote: Show quoted text
> On 14 July 2010 02:25, karl williamson <public@khwilliamson.com> wrote:
>> Ben Morrow wrote:
>>> Quoth public@khwilliamson.com (Karl Williamson): >>> Another point is that Yves said that m//i doesn't currently honour any >>> of these overrides. If we're going to improve case-folding (rather than >>> simply fixing bugs and optimising existing behaviour) it would certainly >>> be helpful to have some way to override this. I don't know enough about >>> what m//i does to know what would be required: would a fourth ToFold >>> override be sufficient?
>> My experience is that the Unicode case-folding is the most broken part of >> Unicode handling. I gave up trying to write TODO test scripts because there >> were just too many things wrong. And they're hard to fix, as the design is >> flawed. Yves has said that he had a plan to redo the whole thing as a trie, >> and I let him know already that this kind of thing might be an issue, and he >> responded that he would keep that in mind.
> > My thoughts concerned performance only. My thinking was that there was > a clever way to store the folding info so that we could improve > unicode case insensitive matching. I started working on it, discovered > a flaw in my plans which seemed like a show stopper, and I haven't > returned to it. Lexically scoped casing rules would definitely not > improve the performance issues. > > Mostly fold casing seems to make sense to me, except that it is > inefficient, can you refresh me, and the list as to the really > problematic cases? Note that our implementation has issues related to > how we model sequences of characters (for instance EXACT/ANYOF etc) > but the underlying premise of fold-casing doesn't seem to me to be > intrinsically broken.... >
I agree that the basic premise is ok, except for things like this: "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i is true. But what about "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i and what are $1 and $2? If someone knows a language that has good Unicode support and an equivalent concept, I'd love to know what they do with this. Basically the problems are with the latin1 non-ascii range, and multi-char folds. The optimizer rejects many things that it shouldn't, and there are problems when the target is a multi-char fold, like "\N{LATIN SMALL LIGATURE FF}" =~ /[a-f][f-z]/i Here's some tickets: 71752 Case-insensitive matching of characters above 255 in ranges in character classes doesn't work 71736 Case insensitive regex matching where the fold is multi-char has many bugs 71734 regex match of Above Latin1 character with ASCII or Latin1 fold doesnt work if the fold is in a character class 71732 regex match of Above Latin1 character with ASCII fold doesnt work if the fold has a quantifier 71730 Many latin1 characters don't match case-insensitively correctly against a utf8 string 55250 utf-8 regex case insensitive character classes mishandle non-utf8 strings I have never tested backreferences with regard to this. I imagine there are bugs there as well.
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 18:49:39 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.2k
On 14 July 2010 18:23, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> Ben Morrow wrote:
>> >> Quoth demerphq@gmail.com (demerphq):
>>> >>> Fold case is basically the longest equivalent of a given character. >>> >>> So for instance the foldcase of \xDF is "ss", however arguably (the >>> latest unicode version and apparently Austria might say differently) >>> the uc of \xDF is \xDF and the lc of \xDF is also \xDF.
> > I don't understand what you're saying.  All versions of Unicode say the > upper case of \xDF is SS, and the titlecase is Ss, though it says the > titlecase should never occur in practice.  I didn't understand someone's > explanation for that.
Ah, never mind. I was trying to make a point, but used a bad example. I was thinking of the practice that when producing signs or legal documents ß is not capitalized even if the rest of the word is. Also it is curious that the uppercase of the U+DF is SS and not the newly created "ẞ" U+1E9E. Anyway, my point was that the rules allow for these things to vary. Show quoted text
>>> So basically the interface would have to support casing from a single >>> character to multiple. Im assuming we already suport this for uc/lc.
> > We do. >
>>> >>> I havent dug, and so forgive me if i ask a stupid question, do we >>> support a distinct "titlecase" mode? Do we have primitives for it? Do >>> we expose foldcasing? Do we expose support for canonicalization?
>> >> ucfirst and \u perform titlecasing for one character only (the rest of >> the string is lowercased). AFAIK that's the only interface.
> > I know of no other interface.  Here's what Unicode says about titlecase: > > "Because of the inclusion of certain composite characters for compatibility, > such as U+01F1 latin capital letter dz, a third case, called titlecase, is > used where the first character of a word must be capitalized. An example of > such a character is U+01F2 latin capital letter d with small letter z. The > three case forms are UPPERCASE, Titlecase, and lowercase.... titlecase > distinctions apply only to a handful of compatibility characters." > > Note there aren't many titlecase characters. > > Foldcase is only exposed through //i.  Canonicalization is only supported by > Unicode::Normalize.
Thanks. -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 18:53:28 +0200
To: karl williamson <public [...] khwilliamson.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 3.4k
On 14 July 2010 18:39, karl williamson <public@khwilliamson.com> wrote: Show quoted text
> demerphq wrote:
>> >> On 14 July 2010 02:25, karl williamson <public@khwilliamson.com> wrote:
>>> >>> Ben Morrow wrote:
>>>> >>>> Quoth public@khwilliamson.com (Karl Williamson): >>>> Another point is that Yves said that m//i doesn't currently honour any >>>> of these overrides. If we're going to improve case-folding (rather than >>>> simply fixing bugs and optimising existing behaviour) it would certainly >>>> be helpful to have some way to override this. I don't know enough about >>>> what m//i does to know what would be required: would a fourth ToFold >>>> override be sufficient?
>>> >>> My experience is that the Unicode case-folding is the most broken part of >>> Unicode handling.  I gave up trying to write TODO test scripts because >>> there >>> were just too many things wrong.  And they're hard to fix, as the design >>> is >>> flawed.  Yves has said that he had a plan to redo the whole thing as a >>> trie, >>> and I let him know already that this kind of thing might be an issue, and >>> he >>> responded that he would keep that in mind.
>> >> My thoughts concerned performance only. My thinking was that there was >> a clever way to store the folding info so that we could improve >> unicode case insensitive matching. I started working on it, discovered >> a flaw in my plans which seemed like a show stopper, and I haven't >> returned to it. Lexically scoped casing rules would definitely not >> improve the performance issues. >> >> Mostly fold casing seems to make sense to me, except that it is >> inefficient, can you refresh me, and the list as to the really >> problematic cases? Note that our implementation has issues related to >> how we model sequences of characters (for instance EXACT/ANYOF etc) >> but the underlying premise of fold-casing doesn't seem to me to be >> intrinsically broken.... >>
> I agree that the basic premise is ok, except for things like this: > > "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i > is true.  But what about > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > and what are $1 and $2?  If someone knows a language that has good Unicode > support and an equivalent concept, I'd love to know what they do with this.
Aha. Nice example. So Unicode documents how it is supposed to match, but not how it is supposed to capture? Show quoted text
> Basically the problems are with the latin1 non-ascii range, and multi-char > folds.  The optimizer rejects many things that it shouldn't,
Ok, i think i know about some of these but not all of them. Show quoted text
> and there are > problems when the target is a multi-char fold, like > "\N{LATIN SMALL LIGATURE FF}" =~ /[a-f][f-z]/i > > Here's some tickets: > 71752   Case-insensitive matching of characters above 255 in ranges in > character classes doesn't work > > 71736   Case insensitive regex matching where the fold is multi-char has > many bugs > > 71734   regex match of Above Latin1 character with ASCII or Latin1 fold > doesnt work if the fold is in a character class > > 71732   regex match of Above Latin1 character with ASCII fold doesnt work if > the fold has a quantifier > > 71730   Many latin1 characters don't match case-insensitively correctly > against a utf8 string > > 55250   utf-8 regex case insensitive character classes mishandle non-utf8 > strings > > I have never tested backreferences with regard to this.  I imagine there are > bugs there as well.
Yeah, I bet. Thanks again, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:59:15 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 1.2k
demerphq wrote: Show quoted text
> On 14 July 2010 18:23, karl williamson <public@khwilliamson.com> wrote:
>> Ben Morrow wrote:
>>> Quoth demerphq@gmail.com (demerphq):
>>>> Fold case is basically the longest equivalent of a given character. >>>> >>>> So for instance the foldcase of \xDF is "ss", however arguably (the >>>> latest unicode version and apparently Austria might say differently) >>>> the uc of \xDF is \xDF and the lc of \xDF is also \xDF.
>> I don't understand what you're saying. All versions of Unicode say the >> upper case of \xDF is SS, and the titlecase is Ss, though it says the >> titlecase should never occur in practice. I didn't understand someone's >> explanation for that.
> > Ah, never mind. I was trying to make a point, but used a bad example. > > I was thinking of the practice that when producing signs or legal > documents ß is not capitalized even if the rest of the word is. > > Also it is curious that the uppercase of the U+DF is SS and not the > newly created "ẞ" U+1E9E.
They were trying to preserve backward compatibility. Also, the new character isn't in widespread use, AFAIK I'm curious how often the 6.0 "PILE OF POO" emoji will be used. I can see it being thrown around in flame wars, or .signature files.
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 11:08:47 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 3.6k
demerphq wrote: Show quoted text
> On 14 July 2010 18:39, karl williamson <public@khwilliamson.com> wrote:
>> demerphq wrote:
>>> On 14 July 2010 02:25, karl williamson <public@khwilliamson.com> wrote:
>>>> Ben Morrow wrote:
>>>>> Quoth public@khwilliamson.com (Karl Williamson): >>>>> Another point is that Yves said that m//i doesn't currently honour any >>>>> of these overrides. If we're going to improve case-folding (rather than >>>>> simply fixing bugs and optimising existing behaviour) it would certainly >>>>> be helpful to have some way to override this. I don't know enough about >>>>> what m//i does to know what would be required: would a fourth ToFold >>>>> override be sufficient?
>>>> My experience is that the Unicode case-folding is the most broken part of >>>> Unicode handling. I gave up trying to write TODO test scripts because >>>> there >>>> were just too many things wrong. And they're hard to fix, as the design >>>> is >>>> flawed. Yves has said that he had a plan to redo the whole thing as a >>>> trie, >>>> and I let him know already that this kind of thing might be an issue, and >>>> he >>>> responded that he would keep that in mind.
>>> My thoughts concerned performance only. My thinking was that there was >>> a clever way to store the folding info so that we could improve >>> unicode case insensitive matching. I started working on it, discovered >>> a flaw in my plans which seemed like a show stopper, and I haven't >>> returned to it. Lexically scoped casing rules would definitely not >>> improve the performance issues. >>> >>> Mostly fold casing seems to make sense to me, except that it is >>> inefficient, can you refresh me, and the list as to the really >>> problematic cases? Note that our implementation has issues related to >>> how we model sequences of characters (for instance EXACT/ANYOF etc) >>> but the underlying premise of fold-casing doesn't seem to me to be >>> intrinsically broken.... >>>
>> I agree that the basic premise is ok, except for things like this: >> >> "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i >> is true. But what about >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> and what are $1 and $2? If someone knows a language that has good Unicode >> support and an equivalent concept, I'd love to know what they do with this.
> > Aha. Nice example. So Unicode documents how it is supposed to match, > but not how it is supposed to capture?
All they say is that two strings are case fold equivalent if their case-folds are equal. They don't address the issue of regexes. I don't think their regexes include capturing. Show quoted text
>
>> Basically the problems are with the latin1 non-ascii range, and multi-char >> folds. The optimizer rejects many things that it shouldn't,
> > Ok, i think i know about some of these but not all of them. >
>> and there are >> problems when the target is a multi-char fold, like >> "\N{LATIN SMALL LIGATURE FF}" =~ /[a-f][f-z]/i >> >> Here's some tickets: >> 71752 Case-insensitive matching of characters above 255 in ranges in >> character classes doesn't work >> >> 71736 Case insensitive regex matching where the fold is multi-char has >> many bugs >> >> 71734 regex match of Above Latin1 character with ASCII or Latin1 fold >> doesnt work if the fold is in a character class >> >> 71732 regex match of Above Latin1 character with ASCII fold doesnt work if >> the fold has a quantifier >> >> 71730 Many latin1 characters don't match case-insensitively correctly >> against a utf8 string >> >> 55250 utf-8 regex case insensitive character classes mishandle non-utf8 >> strings >> >> I have never tested backreferences with regard to this. I imagine there are >> bugs there as well.
> > Yeah, I bet. > > Thanks again, > Yves >
CC: Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 11:09:58 -0600
To: David Golden <xdaveg [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 936b
David Golden wrote: Show quoted text
> On Mon, Jul 12, 2010 at 8:11 PM, Karl Williamson > <public@khwilliamson.com> wrote:
>>> If we're going to fix something that was poorly designed/considered in >>> the first way, let's see about going all the way and coming up with a >>> better design. (cue the caution not to let the "perfect" solution >>> stand in the way of a "good" solution) >>>
>> I agree, but wonder how much work is it going to be. I don't understand the >> implications very well of Ben's patch that just got applied. Does it help >> make this easier, along the lines he outlined in his first email?
> > Or let me make another suggestion -- move it to something like > CORE::GLOBAL::blah so at least people realize that they're mucking > with a global and have the option to localize it or something. > > -- David >
I'm afraid I don't understand your suggestion, probably because of my lack of background with many parts of Perl.
CC: "'Ben Morrow'" <ben [...] morrow.me.uk>, <perl5-porters [...] perl.org>
Subject: RE: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:34:53 -0700
To: "'karl williamson'" <public [...] khwilliamson.com>, "'demerphq'" <demerphq [...] gmail.com>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 737b
On Wed, 14 Jul 2010, karl williamson wrote: Show quoted text
> I agree that the basic premise is ok, except for things like this: > > "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i > is true. But what about > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > and what are $1 and $2?
I'm afraid this might need another flag beyond /i: "\N{LATIN SMALL LIGATURE FF}" =~ /(.)/i Is $1 supposed to be "f" or "\N{LATIN SMALL LIGATURE FF}", and why? So you need some mechanism to specify if capture groups can capture parts of a folded character or not. If they cannot, then "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i cannot match. If they are allowed to capture split characters, then it does match and $1 and $2 should each be set to "f". Cheers, -Jan
CC: karl williamson <public [...] khwilliamson.com>, Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 19:39:59 +0200
To: Jan Dubois <jand [...] activestate.com>
From: demerphq <demerphq [...] gmail.com>
On 14 July 2010 19:34, Jan Dubois <jand@activestate.com> wrote: Show quoted text
> On Wed, 14 Jul 2010, karl williamson wrote:
>> I agree that the basic premise is ok, except for things like this: >> >> "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i >> is true.  But what about >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> and what are $1 and $2?
> > I'm afraid this might need another flag beyond /i: > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(.)/i > > Is $1 supposed to be "f" or "\N{LATIN SMALL LIGATURE FF}", and why? > > So you need some mechanism to specify if capture groups can capture > parts of a folded character or not.  If they cannot, then > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > > cannot match.  If they are allowed to capture split characters, > then it does match and $1 and $2 should each be set to "f".
I was thinking similar lines but when i got the last conclusion i became ill. Consider that this means that a string of length 1 when matched asomething like above may end up with the result that "$1$2" is > 1. Which seems most odd.... Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 18:55:09 +0100
To: demerphq [...] gmail.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 1.9k
Quoth demerphq@gmail.com (demerphq): Show quoted text
> On 14 July 2010 13:37, Ben Morrow <ben@morrow.me.uk> wrote:
> > Quoth demerphq@gmail.com (demerphq):
> >> > >> Fold case is basically the longest equivalent of a given character. > >> > >> So for instance the foldcase of \xDF is "ss", however arguably (the > >> latest unicode version and apparently Austria might say differently) > >> the uc of \xDF is \xDF and the lc of \xDF is also \xDF. > >> > >> So basically the interface would have to support casing from a single > >> character to multiple. Im assuming we already suport this for uc/lc. > >> > >> I havent dug, and so forgive me if i ask a stupid question, do we > >> support a distinct "titlecase" mode? Do we have primitives for it? Do > >> we expose foldcasing? Do we expose support for canonicalization?
> > > > ucfirst and \u perform titlecasing for one character only (the rest of > > the string is lowercased). AFAIK that's the only interface.
> > Ah, but thats the thing, they dont perform /titlecasing/ they provide > "ucfirst", it just happens to be the case that in most western > European languages that the two are the same. In many scripts they > aren't.
I know, and ucfirst uses titlecase rather than uppercase. So my $ff = "\N{LATIN SMALL LIGATURE FF}"; uc($ff) eq "FF"; ucfirst($ff) eq "Ff"; my $dz = "\N{LATIN SMALL LETTER DZ}"; uc($dz) eq "\N{LATIN CAPITAL LETTER DZ}"; ucfirst($dz) eq "\N{LATIN CAPTIAL LETTER D WITH SMALL LETTER Z}"; (Incidentally, Karl appeasr to be right that titlecasing is only used for ligatures and other special cases. I had in the back of my mind that there were some Arabic scripts with whole alphabets of titlecase, but apparently not.) Show quoted text
> > Effectively, but without introducing a whole lot of new operator names.
> > Hmm, i think id be happier with the operators. But i can see what you mean.
Well, I'd be fine with that too. It'd need to be under 'feature', of course, which is perhaps why it wasn't done that way from the start. Ben
CC: "'karl williamson'" <public [...] khwilliamson.com>, "'Ben Morrow'" <ben [...] morrow.me.uk>, <perl5-porters [...] perl.org>
Subject: RE: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 10:58:46 -0700
To: "'demerphq'" <demerphq [...] gmail.com>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 1.5k
On Wed, 14 Jul 2010, demerphq wrote: Show quoted text
> On 14 July 2010 19:34, Jan Dubois <jand@activestate.com> wrote:
> > On Wed, 14 Jul 2010, karl williamson wrote:
> >> I agree that the basic premise is ok, except for things like this: > >> > >> "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i > >> is true.  But what about > >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > >> and what are $1 and $2?
> > > > I'm afraid this might need another flag beyond /i: > > > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(.)/i > > > > Is $1 supposed to be "f" or "\N{LATIN SMALL LIGATURE FF}", and why? > > > > So you need some mechanism to specify if capture groups can capture > > parts of a folded character or not.  If they cannot, then > > > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > > > > cannot match.  If they are allowed to capture split characters, > > then it does match and $1 and $2 should each be set to "f".
> > I was thinking similar lines but when i got the last conclusion i became ill. > > Consider that this means that a string of length 1 when matched > asomething like above may end up with the result that "$1$2" is > 1. > > Which seems most odd....
Indeed. So maybe the correct "solution" is to say that capture groups will never capture split folded characters, which seems reasonable to me, but may just be a lack of my imagination. In that case we can say that /i only affects the matching and not the capturing, so after this expression: "\N{LATIN SMALL LIGATURE FF}" =~ /(ff)/i $1 is guaranteed to be "\N{LATIN SMALL LIGATURE FF}" and not "ff". Cheers, -Jan
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 20:04:50 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.2k
On 14 July 2010 19:55, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> Quoth demerphq@gmail.com (demerphq):
>> On 14 July 2010 13:37, Ben Morrow <ben@morrow.me.uk> wrote:
>> > Quoth demerphq@gmail.com (demerphq):
>> >> >> >> Fold case is basically the longest equivalent of a given character. >> >> >> >> So for instance the foldcase of \xDF is "ss", however arguably (the >> >> latest unicode version and apparently Austria might say differently) >> >> the uc of \xDF is \xDF and the lc of \xDF is also \xDF. >> >> >> >> So basically the interface would have to support casing from a single >> >> character to multiple. Im assuming we already suport this for uc/lc. >> >> >> >> I havent dug, and so forgive me if i ask a stupid question, do we >> >> support a distinct "titlecase" mode? Do we have primitives for it? Do >> >> we expose foldcasing? Do we expose support for canonicalization?
>> > >> > ucfirst and \u perform titlecasing for one character only (the rest of >> > the string is lowercased). AFAIK that's the only interface.
>> >> Ah, but thats the thing, they dont perform /titlecasing/ they provide >> "ucfirst", it just happens to be the case that in most western >> European languages that the two are the same. In many scripts they >> aren't.
> > I know, and ucfirst uses titlecase rather than uppercase. So > >    my $ff = "\N{LATIN SMALL LIGATURE FF}"; > >    uc($ff)         eq "FF"; >    ucfirst($ff)    eq "Ff"; > >    my $dz = "\N{LATIN SMALL LETTER DZ}"; > >    uc($dz)         eq "\N{LATIN CAPITAL LETTER DZ}"; >    ucfirst($dz)    eq "\N{LATIN CAPTIAL LETTER D WITH SMALL LETTER Z}";
Doh. Dont i feel like a prat now. Thanks for tolerating me. :-) Show quoted text
> (Incidentally, Karl appeasr to be right that titlecasing is only used > for ligatures and other special cases. I had in the back of my mind that > there were some Arabic scripts with whole alphabets of titlecase, but > apparently not.)
Yeah me too. Show quoted text
>> > Effectively, but without introducing a whole lot of new operator names.
>> >> Hmm, i think id be happier with the operators. But i can see what you mean.
> > Well, I'd be fine with that too. It'd need to be under 'feature', of > course, which is perhaps why it wasn't done that way from the start.
Agreed. yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 20:24:08 +0200
To: Jan Dubois <jand [...] activestate.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.1k
On 14 July 2010 19:58, Jan Dubois <jand@activestate.com> wrote: Show quoted text
> On Wed, 14 Jul 2010, demerphq wrote:
>> On 14 July 2010 19:34, Jan Dubois <jand@activestate.com> wrote:
>> > On Wed, 14 Jul 2010, karl williamson wrote:
>> >> I agree that the basic premise is ok, except for things like this: >> >> >> >> "\N{LATIN SMALL LIGATURE FF}" =~ /ff/i >> >> is true.  But what about >> >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> >> and what are $1 and $2?
>> > >> > I'm afraid this might need another flag beyond /i: >> > >> >    "\N{LATIN SMALL LIGATURE FF}" =~ /(.)/i >> > >> > Is $1 supposed to be "f" or "\N{LATIN SMALL LIGATURE FF}", and why? >> > >> > So you need some mechanism to specify if capture groups can capture >> > parts of a folded character or not.  If they cannot, then >> > >> >    "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> > >> > cannot match.  If they are allowed to capture split characters, >> > then it does match and $1 and $2 should each be set to "f".
>> >> I was thinking similar lines but when i got the last conclusion i became ill. >> >> Consider that this means that a string of length 1 when matched >> asomething like above may end up with the result that "$1$2" is > 1. >> >> Which seems most odd....
> > Indeed.  So maybe the correct "solution" is to say that capture groups > will never capture split folded characters, which seems reasonable to me, > but may just be a lack of my imagination.
That seems to me to be trouble. Show quoted text
> In that case we can say that /i only affects the matching and not the > capturing, so after this expression: > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(ff)/i > > $1 is guaranteed to be "\N{LATIN SMALL LIGATURE FF}" and not "ff".
That one matching seems obvious to me. The real question is this one: "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i Should it match? If yes, what should $1 and $2 hold? A) yes, $1=f, $2=f B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" D) no I guess B is the least insane option, but it still poses problems. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: "'karl williamson'" <public [...] khwilliamson.com>, "'Ben Morrow'" <ben [...] morrow.me.uk>, <perl5-porters [...] perl.org>
Subject: RE: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 12:09:40 -0700
To: "'demerphq'" <demerphq [...] gmail.com>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 2.1k
On Wed, 14 Jul 2010, demerphq wrote: Show quoted text
> > Indeed.  So maybe the correct "solution" is to say that capture groups > > will never capture split folded characters, which seems reasonable to me, > > but may just be a lack of my imagination.
> > That seems to me to be trouble.
Can you elaborate? Can you come up with a non-degenerate use case for when splitting things up actually makes sense? And if you do, wouldn't it make more sense to have a Unicode::foldcase() function that would transform your string first, and then you run a normal match against that instead? Show quoted text
> > In that case we can say that /i only affects the matching and not the > > capturing, so after this expression: > > > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(ff)/i > > > > $1 is guaranteed to be "\N{LATIN SMALL LIGATURE FF}" and not "ff".
> > That one matching seems obvious to me. The real question is this one: > > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > > Should it match? If yes, what should $1 and $2 hold? > > A) yes, $1=f, $2=f > B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" > C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" > D) no > > I guess B is the least insane option, but it still poses problems.
Sorry, (B) and (C) don't make any sense to me whatsoever because the content of $1 and $2 no longer correspond to the actual pattern inside the capturing groups. (A) does make a certain amount of sense, but has lots of problems. And from I practical point of view I think you rarely want this. [*] (D) makes complete sense to me once I accept that case folding is for matching purposes only but never affects capturing. It has the huge advantage that the capture groups will never contain folded characters but always the original ones. This is similar to how regular expressions cannot capture individual bytes from a multi-byte UTF-8 sequence into different capture groups without first transforming the string into a byte-string. Cheers, -Jan [*] I know you already discounted (A) too, but in case you still prefer it over (D): "\N{LATIN SMALL LIGATURE FF}" =~ /(.+?)/i would this capture "f" or "\N{LATIN SMALL LIGATURE FF}"?
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 20:59:02 +0100
To: jand [...] activestate.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 1.6k
Quoth jand@activestate.com ("Jan Dubois"): Show quoted text
> On Wed, 14 Jul 2010, demerphq wrote:
> > > > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > > > > Should it match? If yes, what should $1 and $2 hold? > > > > A) yes, $1=f, $2=f > > B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" > > C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" > > D) no > > > > I guess B is the least insane option, but it still poses problems.
> > Sorry, (B) and (C) don't make any sense to me whatsoever because the > content of $1 and $2 no longer correspond to the actual pattern inside > the capturing groups.
That's usual with /i: "F" =~ /(f)/i; # $1 eq "F" I don't, however, think either B or C are right in this case. In fact, I'm rather surprised to find "\N{ff}" =~ /ff/i at all: /i is about *case-folding*, not breaking up ligatures. However, if it does match I would expect A. Show quoted text
> (A) does make a certain amount of sense, but has > lots of problems. And from I practical point of view I think you rarely > want this. [*]
<snip> Show quoted text
> [*] I know you already discounted (A) too, but in case you still prefer it > over (D): > > "\N{LATIN SMALL LIGATURE FF}" =~ /(.+?)/i > > would this capture "f" or "\N{LATIN SMALL LIGATURE FF}"?
IMHO: "\N{ff}". However, "\N{ff}" =~ /(f+?)/i or some equivalent that requires the lig to be split in order to match should capture "f". (This may not be possible with the current casefild implementation, of course.) This further implies that I would expect "\N{ff}f" =~ /(ff)/i to capture "ff" (two characters, no lig, the last character in the string isn't matched), even though "\N{ff}f" =~ /(..)/i would capture "\N{ff}f" (and match the whole string). Ben
CC: karl williamson <public [...] khwilliamson.com>, Ben Morrow <ben [...] morrow.me.uk>, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 22:04:39 +0200
To: Jan Dubois <jand [...] activestate.com>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.3k
On 14 July 2010 21:09, Jan Dubois <jand@activestate.com> wrote: Show quoted text
> On Wed, 14 Jul 2010, demerphq wrote:
>> > Indeed.  So maybe the correct "solution" is to say that capture groups >> > will never capture split folded characters, which seems reasonable to me, >> > but may just be a lack of my imagination.
>> >> That seems to me to be trouble.
> > Can you elaborate?  Can you come up with a non-degenerate use case for > when splitting things up actually makes sense?
I didnt say and didnt mean to imply that we should split things up. I dont think we should be "splitting" a single character at all, except in term of doing so to facilitate the pattern matching (not capturing). Show quoted text
>And if you do, wouldn't > it make more sense to have a Unicode::foldcase() function that would > transform your string first, and then you run a normal match against that > instead?
I was going to suggest that if we wanted A) below. Show quoted text
>
>> > In that case we can say that /i only affects the matching and not the >> > capturing, so after this expression: >> > >> >    "\N{LATIN SMALL LIGATURE FF}" =~ /(ff)/i >> > >> > $1 is guaranteed to be "\N{LATIN SMALL LIGATURE FF}" and not "ff".
>> >> That one matching seems obvious to me. The real question is this one: >> >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> >> Should it match? If yes, what should $1 and $2 hold? >> >> A) yes, $1=f, $2=f >> B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" >> C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" >> D) no >> >> I guess B is the least insane option, but it still poses problems.
> > Sorry, (B) and (C) don't make any sense to me whatsoever because the > content of $1 and $2 no longer correspond to the actual pattern inside > the capturing groups.
Doesnt it depend how you decompose things? I mean, i think that what you are saying is that you expect something like this to always work: $str=~/($pat1)($pat2)/i and $1=~/$pat1/i and $2=~/$pat2/i; If so then it seems to be that C qualifies. I would expect that: "\N{LATIN SMALL LIGATURE FF}"=~/(f)/i to result in $1 being "\N{LATIN SMALL LIGATURE FF}". Do you agree? If not can you explain what you mean by "correspond"? Show quoted text
> (A) does make a certain amount of sense, but has > lots of problems. And from I practical point of view I think you rarely > want this. [*]
See thats funny, this is the last thing id expect given that: "F"=~/(f)/i results in $1 containing "F". If it contained "f" maybe you would have a point but im not getting you. I think it would extremely bizarre if matching resulted in $1 containing a character that wasnt in the original string at all. What the point of capturing then? How would $1 and the pos offsets work? Etc... Show quoted text
> > (D) makes complete sense to me once I accept that case folding is for > matching purposes only but never affects capturing. It has the huge > advantage that the capture groups will never contain folded characters > but always the original ones.
D makes more sense to me than A, but it doesn't make /that/ much sense. I would expect that I can take any pattern and add capturing brackets to it arbitrarily but legally, and have it match unchanged. Do you think otherwise? It seems to me that if /(f)(f)/ doesnt match then nor should: "\N{LATIN SMALL LIGATURE FF}"=~/(ff)/i Does that make sense? Show quoted text
> This is similar to how regular expressions cannot capture individual > bytes from a multi-byte UTF-8 sequence into different capture groups > without first transforming the string into a byte-string.
Well not really, individual bytes of utf8 sequence are meant to be invisible to a programmer. It should be possible to change the internal representation of unicode strings in perl and have no code break. Show quoted text
> > Cheers, > -Jan > > [*] I know you already discounted (A) too, but in case you still prefer it > over (D): > >    "\N{LATIN SMALL LIGATURE FF}" =~ /(.+?)/i > > would this capture "f" or "\N{LATIN SMALL LIGATURE FF}"?
Definitely "\N{LATIN SMALL LIGATURE FF}". It seems to me we would /like/ to have the following be true: 1: if $s=~/(A)(B)/ then $1=~/(A)/ and $2=~/(B)/ 2: if $s=~/(A)(B)/ and A cannot evaluate to false and B cannot evaluate to false then $1 and $2 should be true. 3: if $s=~/(A)(B)/ then $s=~/$1$2/ should be true. This isnt an exhaustive list, but it seems to me that we already cannot have both 2 and 3 with case insensitive matching in unicode. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: jand [...] activestate.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 22:17:19 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.5k
On 14 July 2010 21:59, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> Quoth jand@activestate.com ("Jan Dubois"):
>> On Wed, 14 Jul 2010, demerphq wrote:
>> > >> > "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i >> > >> > Should it match? If yes, what should $1 and $2 hold? >> > >> > A) yes, $1=f, $2=f >> > B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" >> > C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" >> > D) no >> > >> > I guess B is the least insane option, but it still poses problems.
>> >> Sorry, (B) and (C) don't make any sense to me whatsoever because the >> content of $1 and $2 no longer correspond to the actual pattern inside >> the capturing groups.
> > That's usual with /i: > >    "F" =~ /(f)/i;      # $1 eq "F" > > I don't, however, think either B or C are right in this case. In fact, > I'm rather surprised to find "\N{ff}" =~ /ff/i at all: /i is about > *case-folding*, not breaking up ligatures. However, if it does match I > would expect A.
But thats the whole idea of case folding. That you convert each character into its longest canonical equivalent. Dont forget what you said earlier: my $ff = "\N{LATIN SMALL LIGATURE FF}"; uc($ff) eq "FF"; ucfirst($ff) eq "Ff"; given that doesnt it make sense that case insensitively $ff=~/ff/i ? Show quoted text
>
>> (A) does make a certain amount of sense, but has >> lots of problems. And from I practical point of view I think you rarely >> want this. [*]
> <snip>
>> [*] I know you already discounted (A) too, but in case you still prefer it >> over (D): >> >>     "\N{LATIN SMALL LIGATURE FF}" =~ /(.+?)/i >> >> would this capture "f" or "\N{LATIN SMALL LIGATURE FF}"?
> > IMHO: "\N{ff}". However, > >    "\N{ff}" =~ /(f+?)/i > > or some equivalent that requires the lig to be split in order to match > should capture "f". (This may not be possible with the current casefild > implementation, of course.)
Well it would require how capture buffers work being completely changed. However, it seems to me this isnt consistent with how ucfirst() lcfirst() work, and how you expected them to work. Show quoted text
> > This further implies that I would expect > >    "\N{ff}f" =~ /(ff)/i > > to capture "ff" (two characters, no lig, the last character in the > string isn't matched), even though > >    "\N{ff}f" =~ /(..)/i > > would capture "\N{ff}f" (and match the whole string).
As i said before this would definitely be the most bizarre result for me. I would be extremely surprised if a capture buffer ended up containing a character that was not in the matched string at all. Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: karl williamson <public [...] khwilliamson.com>, Ben Morrow <ben [...] morrow.me.uk>, Jan Dubois <jand [...] activestate.com>
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 22:23:28 +0200
To: Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 793b
On 14 July 2010 22:04, demerphq <demerphq@gmail.com> wrote: Show quoted text
> It seems to me we would /like/ to have the following be true: > > 1: if $s=~/(A)(B)/  then $1=~/(A)/ and $2=~/(B)/ > > 2: if $s=~/(A)(B)/ and A cannot evaluate to false and B cannot > evaluate to false then $1 and $2 should be true. > > 3: if $s=~/(A)(B)/ then $s=~/$1$2/ should be true. > > This isnt an exhaustive list, but it seems to me that we already > cannot have both 2 and 3 with case insensitive matching in unicode. >
FWIW, i think that this is the right approach to solving this problem. We should enumerate all the things we expect to hold true when matching and then pick the option that satisfies the most criteria. Can anybody think of more criteria? Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
CC: jand [...] activestate.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 16:26:49 -0400
To: Ben Morrow <ben [...] morrow.me.uk>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.3k
On Wed, Jul 14, 2010 at 3:59 PM, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> > Sorry, (B) and (C) don't make any sense to me whatsoever because the > > content of $1 and $2 no longer correspond to the actual pattern inside > > the capturing groups.
>
Show quoted text
>
That's usual with /i: Show quoted text
> > "F" =~ /(f)/i; # $1 eq "F" >
Your example doesn't show what it claims to. In your example, $1 matches the pattern inside the capturing group. With (B) and (C), $1 and $2 don't match the pattern inside the capturing group. Your example simply shows that $1 doesn't contain the pattern, but noone said it does or should. I don't, however, think either B or C are right in this case. In fact, Show quoted text
> I'm rather surprised to find "\N{ff}" =~ /ff/i at all: /i is about > *case-folding*, not breaking up ligatures. However, if it does match I > would expect A. >
Case folding apparently does including flattening ligatures. This further implies that I would expect Show quoted text
> > "\N{ff}f" =~ /(ff)/i > > to capture "ff" (two characters, no lig, the last character in the > string isn't matched)
I don't like the idea of capturing something that doesn't exist. I'd rather see "F" =~ /(f)/i # $1 = F "F" =~ /(.)/i # $1 = F "\N{ff}" =~ /(ff)/i # $1 = \N{ff} "\N{ff}" =~ /(.)/i # $1 = \N{ff} than "F" =~ /(f)/i # $1 = f "F" =~ /(.)/i # $1 = F "\N{ff}" =~ /(ff)/i # $1 = ff "\N{ff}" =~ /(.)/i # $1 = \N{ff} - Eric
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 21:41:24 +0100
To: demerphq [...] gmail.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 1.1k
Quoth demerphq@gmail.com (demerphq): Show quoted text
> On 14 July 2010 21:59, Ben Morrow <ben@morrow.me.uk> wrote:
> > > > I don't, however, think either B or C are right in this case. In fact, > > I'm rather surprised to find "\N{ff}" =~ /ff/i at all: /i is about > > *case-folding*, not breaking up ligatures. However, if it does match I > > would expect A.
> > But thats the whole idea of case folding. That you convert each > character into its longest canonical equivalent. > > Dont forget what you said earlier: > > my $ff = "\N{LATIN SMALL LIGATURE FF}"; > uc($ff) eq "FF"; > ucfirst($ff) eq "Ff"; > > given that doesnt it make sense that case insensitively $ff=~/ff/i ?
Not to me, no. There is no case-change on "\N{ff}" which gives "ff", and no case-change on "ff" which gives "\N{ff}". I *would* expect /FF/i to match; I'm not sure what I would expect /(F)(F)/i to capture. Probably ("f", "f"), though I might be more inclined to throw a warning and fail the match. I don't know. This whole issue of case-changes which change the number of characters is making my head hurt :). I think I would want to see some real-world examples, to see what results made the most sense. Ben
CC: "'karl williamson'" <public [...] khwilliamson.com>, "'Ben Morrow'" <ben [...] morrow.me.uk>, <perl5-porters [...] perl.org>
Subject: RE: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 13:41:37 -0700
To: "'demerphq'" <demerphq [...] gmail.com>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 1.1k
On Wed, 14 Jul 2010, demerphq wrote: Show quoted text
> On 14 July 2010 21:09, Jan Dubois <jand@activestate.com> wrote:
> >> "\N{LATIN SMALL LIGATURE FF}" =~ /(f)(f)/i > >> > >> Should it match? If yes, what should $1 and $2 hold? > >> > >> A) yes, $1=f, $2=f > >> B) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2="" > >> C) yes, $1="\N{LATIN SMALL LIGATURE FF}", $2=""\N{LATIN SMALL LIGATURE FF}"" > >> D) no > >> > >> I guess B is the least insane option, but it still poses problems.
> > > > Sorry, (B) and (C) don't make any sense to me whatsoever because the > > content of $1 and $2 no longer correspond to the actual pattern inside > > the capturing groups.
> > Doesnt it depend how you decompose things? > > I mean, i think that what you are saying is that you expect something > like this to always work: > > $str=~/($pat1)($pat2)/i and $1=~/$pat1/i and $2=~/$pat2/i;
Almost, I expect this to be true: $str=~/($pat1)($pat2)/i and $1=~/\A$pat1\z/i and $2=~/\A$pat2\z/i; I don't expect anything in a capture group that wasn't matched by the capture expression. Show quoted text
> If so then it seems to be that C qualifies.
Not in my more strict expectation. :) Cheers, -Jan
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 23:21:25 +0100
To: ikegami [...] adaelis.com, perl5-porters [...] perl.org
From: Ben Morrow <ben [...] morrow.me.uk>
Download (untitled) / with headers
text/plain 2.2k
Quoth ikegami@adaelis.com (Eric Brine): Show quoted text
> On Wed, Jul 14, 2010 at 3:59 PM, Ben Morrow <ben@morrow.me.uk> wrote: >
[ Jan wrote: ] Show quoted text
> > > Sorry, (B) and (C) don't make any sense to me whatsoever because the > > > content of $1 and $2 no longer correspond to the actual pattern inside > > > the capturing groups.
>
> >
> That's usual with /i:
> > > > "F" =~ /(f)/i; # $1 eq "F" > >
> > Your example doesn't show what it claims to. In your example, $1 matches the > pattern inside the capturing group. With (B) and (C), $1 and $2 don't match > the pattern inside the capturing group. Your example simply shows that $1 > doesn't contain the pattern, but noone said it does or should.
You're right. Show quoted text
> I don't, however, think either B or C are right in this case. In fact,
> > I'm rather surprised to find "\N{ff}" =~ /ff/i at all: /i is about > > *case-folding*, not breaking up ligatures. However, if it does match I > > would expect A. > >
> > Case folding apparently does including flattening ligatures.
Hmmm. I suppose if we're supposed to be following Unicode, and Unicode mandates insane things, we're obliged to do insane things as well. Show quoted text
> This further implies that I would expect
> > > > "\N{ff}f" =~ /(ff)/i > > > > to capture "ff" (two characters, no lig, the last character in the > > string isn't matched)
> > > I don't like the idea of capturing something that doesn't exist. > > I'd rather see > > "F" =~ /(f)/i # $1 = F > "F" =~ /(.)/i # $1 = F > "\N{ff}" =~ /(ff)/i # $1 = \N{ff} > "\N{ff}" =~ /(.)/i # $1 = \N{ff}
What about "\N{ff}" =~ /([a-z][a-z])/i Would you really expect to get a single-character result from that? A slightly-unrelated issue: this is potentially a security problem. Anyone validating input with /[a-z]/i is *not* going to expect ligatures to be let through. Yves may be right, that @+ and pos and so on require the captures to be strict substrings of the matched string, in which case there is no choice but for a capture to take either the whole lig or none of it. Show quoted text
> than > > "F" =~ /(f)/i # $1 = f
I was never proposing this. My rule was along the lines of 'if it is necessary to break up a lig to get it to match, the captures are taken from the broken-up lig'. Ben
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 20:37:06 -0400
To: Ben Morrow <ben [...] morrow.me.uk>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 1.2k
On Wed, Jul 14, 2010 at 6:21 PM, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> What about > > "\N{ff}" =~ /([a-z][a-z])/i > > Would you really expect to get a single-character result from that? >
Correct. As long as do ligature expansion for case folding, I don't see an alternative. I wouldn't want the action of extracting a word from to change the word as a result. Where do you draw the line? my ($m1) = "Œuf" =~ /(\w+)/; my ($m2) = "Œuf" =~ /(\w+)/i; my ($m3) = "Œuf" =~ /([a-z]+)/i; my ($m4) = "Œuf" =~ /([a-z]{4})/i; my ($m5) = "Œuf" =~ /([a-z][a-z][a-z][a-z]/i; Or how about ( my $s1 = "Œ!uf" ) =~ s/(\w+)!(\w+)/$1$2/; ( my $s2 = "Œ!uf" ) =~ s/(\w+)!(\w+)/$1$2/i; ( my $s3 = "Œ!uf" ) =~ s/([a-z]+)!([a-z]+)/$1$2/; ( my $s4 = "Œ!uf" ) =~ s/([a-z]{2})!([a-z]{2})/$1$2/; ( my $s5 = "Œ!uf" ) =~ s/([a-z][a-z])!([a-z][a-z])/$1$2/; So as far as I'm concerned, your question is really "should we expand ligatures for case folding"? I don't know the answer to that, or what the default should be if we let the user decide. A slightly-unrelated issue: this is potentially a security problem. Show quoted text
>
Anyone validating input with /[a-z]/i is *not* going to expect ligatures Show quoted text
>
to be let through. Show quoted text
>
Note that we already have something similar with /\w/
CC: perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 20:48:24 -0400
To: Ben Morrow <ben [...] morrow.me.uk>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 316b
On Wed, Jul 14, 2010 at 8:37 PM, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text
> So as far as I'm concerned, your question is really "should we expand > ligatures for case folding"?
Or more specifically, "should we use simple case folding or full case folding". The former folds each character into only one character.
CC: <perl5-porters [...] perl.org>
Subject: RE: PATCH [perl #58182] partial: user-defined casing
Date: Wed, 14 Jul 2010 18:21:58 -0700
To: "'Eric Brine'" <ikegami [...] adaelis.com>, "'Ben Morrow'" <ben [...] morrow.me.uk>
From: "Jan Dubois" <jand [...] activestate.com>
Download (untitled) / with headers
text/plain 793b
On Wed, 14 Jul 2010, Eric Brine wrote: Show quoted text
>
> > So as far as I'm concerned, your question is really "should we expand > > ligatures for case folding"?
> > Or more specifically, "should we use simple case folding or full case > folding". The former folds each character into only one character.
Yes, that is a very good point. Especially since the "Unicode Default Caseless Matching Algorithm" (section 3.13, "Default Case Algorithms") recommends: "Caseless matching should also use normalization, ..." Therefore I wouldn't mind punting on the whole issue and restricting caseless matches in the regex engine to simple case folding only. The Unicode::Casefolding logic could then be implemented as a CPAN module to allow combining case folding with normalization as required. Cheers, -Jan
CC: ikegami [...] adaelis.com, perl5-porters [...] perl.org
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Thu, 15 Jul 2010 09:09:03 +0200
To: Ben Morrow <ben [...] morrow.me.uk>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 511b
On 15 July 2010 00:21, Ben Morrow <ben@morrow.me.uk> wrote: Show quoted text
> A slightly-unrelated issue: this is potentially a security problem. > Anyone validating input with /[a-z]/i is *not* going to expect ligatures > to be let through.
Oh this boat sailed a long time with /\A\d+\z/ which of course matches about 200 codepoints, and not 10. Personally I consider this a lot more dangerous in terms of security than /[a-z]/i letting through ligatures. cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
Subject: Re: PATCH [perl #58182] partial: user-defined casing
Date: Thu, 15 Jul 2010 11:16:46 +0200
To: perl5-porters [...] perl.org
From: "Dr.Ruud" <rvtol+usenet [...] isolution.nl>
Download (untitled) / with headers
text/plain 244b
Ben Morrow wrote: Show quoted text
> What about > > "\N{ff}" =~ /([a-z][a-z])/i > > Would you really expect to get a single-character result from that?
That could depend on the active normalization form. http://unicode.org/reports/tr15/ -- Ruud
CC: perl5-porters [...] perl.org
Subject: PATCH [perl #58182] partial: Unicode semantics for \s, \w, \b
Date: Sun, 15 Aug 2010 13:35:33 -0600
To: demerphq <demerphq [...] gmail.com>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 983b
make regen needed The attached series of commits extends the control of feature unicode_strings to cover regex matching of the sequences \b, \s, \w, and their complements. Various versions of this have been submitted before, and found wanting, with more ground-work needed. Hopefully I've laid that properly for this one. This patch does not include the new regex modifiers; that one is in process, and will be submitted soon. I'm waiting for the discussion to die down about whether it should be an error to ask for the default set of modifiers and also list one of them. But, the regex modifiers patch is mostly complete. I'm submitting this now to allow time for it to be picked apart, before the full implementation comes. Note, that what is missing so far is the retention of the unicode semantics interpretation when a compiled regex is interpolated into a larger one. The patch is also available at git://github.com/khwilliamson/perl.git branch regex_mods
From 39a3e1025ef1d269e55b86797613eed8429bd5e5 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 09:47:50 -0600 Subject: [PATCH] handy.h: Add isSPACEU() with Unicode semantics --- handy.h | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/handy.h b/handy.h index d6205c5..8e8f7bb 100644 --- a/handy.h +++ b/handy.h @@ -478,6 +478,8 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc #define isALPHA(c) (isUPPER(c) || isLOWER(c)) /* ALPHAU includes Unicode semantics for latin1 characters. It has an extra * >= AA test to speed up ASCII-only tests at the expense of the others */ +/* XXX decide whether to document the ALPHAU, ALNUMU and isSPACEU functions. + * Most of these should be implemented as table lookup for speed */ #define isALPHAU(c) (isALPHA(c) || (NATIVE_TO_UNI((U8) c) >= 0xAA \ && ((NATIVE_TO_UNI((U8) c) >= 0xC0 \ && NATIVE_TO_UNI((U8) c) != 0xD7 && NATIVE_TO_UNI((U8) c) != 0xF7) \ @@ -490,6 +492,8 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc #define isCHARNAME_CONT(c) (isALNUMU(c) || (c) == ' ' || (c) == '-' || (c) == '(' || (c) == ')' || (c) == ':' || NATIVE_TO_UNI((U8) c) == 0xA0) #define isSPACE(c) \ ((c) == ' ' || (c) == '\t' || (c) == '\n' || (c) =='\r' || (c) == '\f') +#define isSPACEU(c) (isSPACE(c) \ + || (NATIVE_TO_UNI(c) == 0x85 || NATIVE_TO_UNI(c) == 0xA0)) #define isPSXSPC(c) (isSPACE(c) || (c) == '\v') #define isBLANK(c) ((c) == ' ' || (c) == '\t') #define isDIGIT(c) ((c) >= '0' && (c) <= '9') -- 1.5.6.3
From abce67bf1f0f87a1603c9a013151d3d6349e7963 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 14:22:01 -0600 Subject: [PATCH] perlrebackslash: Fix grammatical error --- pod/perlrebackslash.pod | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index d460f7f..91c4d7d 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -475,7 +475,7 @@ backslash sequences. =item \A C<\A> only matches at the beginning of the string. If the C</m> modifier -isn't used, then C</\A/> is equivalent with C</^/>. However, if the C</m> +isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m> modifier is used, then C</^/> matches internal newlines, but the meaning of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning of the string regardless whether the C</m> modifier is used. @@ -483,7 +483,7 @@ of the string regardless whether the C</m> modifier is used. =item \z, \Z C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't -used, then C</\Z/> is equivalent with C</$/>, that is, it matches at the +used, then C</\Z/> is equivalent to C</$/>, that is, it matches at the end of the string, or before the newline at the end of the string. If the C</m> modifier is used, then C</$/> matches at internal newlines, but the meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at -- 1.5.6.3
From 15a3a096b99e5edaa7c4aa1fdba74e75113209f2 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 15:49:36 -0600 Subject: [PATCH] regcomp.c: rmv trail blanks; uncuddle else --- regcomp.c | 23 ++++++++++------------- 1 files changed, 10 insertions(+), 13 deletions(-) diff --git a/regcomp.c b/regcomp.c index 43b881d..04fc038 100644 --- a/regcomp.c +++ b/regcomp.c @@ -870,7 +870,7 @@ S_cl_or(const RExC_state_t *pRExC_state, struct regnode_charclass_class *cl, con Dumps the final compressed table form of the trie to Perl_debug_log. Used for debugging make_trie(). */ - + STATIC void S_dump_trie(pTHX_ const struct _reg_trie_data *trie, HV *widecharmap, AV *revcharmap, U32 depth) @@ -3196,7 +3196,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, /* These are the cases when once a subexpression fails at a particular position, it cannot succeed even after backtracking at the enclosing scope. - + XXXX what if minimal match and we are at the initial run of {n,m}? */ if ((mincount != maxcount - 1) && (maxcount != REG_INFTY)) @@ -3337,7 +3337,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, #if 0 while ( nxt1 && (OP(nxt1) != WHILEM)) { regnode *nnxt = regnext(nxt1); - if (nnxt == nxt) { if (reg_off_by_arg[OP(nxt1)]) ARG_SET(nxt1, nxt2 - nxt1); @@ -3404,7 +3403,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, if (UTF) old = utf8_hop((U8*)s, old) - (U8*)s; - l -= old; /* Get the added string: */ last_str = newSVpvn_utf8(s + old, l, UTF); @@ -3492,13 +3490,13 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, if (flags & SCF_DO_STCLASS_AND) { for (value = 0; value < 256; value++) if (!is_VERTWS_cp(value)) - ANYOF_BITMAP_CLEAR(data->start_class, value); - } - else { + ANYOF_BITMAP_CLEAR(data->start_class, value); + } + else { for (value = 0; value < 256; value++) if (is_VERTWS_cp(value)) - ANYOF_BITMAP_SET(data->start_class, value); - } + ANYOF_BITMAP_SET(data->start_class, value); + } if (flags & SCF_DO_STCLASS_OR) cl_and(data->start_class, and_withp); flags &= ~SCF_DO_STCLASS; @@ -3511,7 +3509,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, data->pos_delta += 1; data->longest = &(data->longest_float); } - } else if (OP(scan) == FOLDCHAR) { int d = ARG(scan)==0xDF ? 1 : 2; @@ -3609,7 +3606,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (!isALNUM(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; @@ -3698,7 +3695,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (isDIGIT(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; @@ -3715,7 +3712,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (!isDIGIT(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; -- 1.5.6.3
From 1244640261b0259171fc88efd4c79187bde6004d Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 18:01:22 -0600 Subject: [PATCH] regexec.c: add and refactor macros Add macros that will have unicode semantics; these share much code in common with ones that don't. So factor out that common code. These might be good candidates for inline functions when they are settled on. --- regexec.c | 32 +++++++++++++++++++++++++++----- 1 files changed, 27 insertions(+), 5 deletions(-) diff --git a/regexec.c b/regexec.c index ef55635..6d36a9c 100644 --- a/regexec.c +++ b/regexec.c @@ -176,11 +176,11 @@ #endif -#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - case NAMEL: \ +#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ PL_reg_flags |= RF_tainted; \ /* FALL THROUGH */ \ - case NAME: \ + case NAME: \ if (!nextchr) \ sayNO; \ if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ @@ -202,12 +202,25 @@ nextchr = UCHARAT(locinput); \ break; \ } \ + /* Finished up by calling macro */ + +#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ sayNO; \ nextchr = UCHARAT(++locinput); \ break -#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ +/* Almost identical to the above, but has a case for a node that matches chars + * between 128 and 255 using Unicode (latin1) semantics. */ +#define CCC_TRY_AFF_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if (!(OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ + break + +#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ case NAMEL: \ PL_reg_flags |= RF_tainted; \ /* FALL THROUGH */ \ @@ -232,13 +245,22 @@ locinput += PL_utf8skip[nextchr]; \ nextchr = UCHARAT(locinput); \ break; \ - } \ + } + +#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ sayNO; \ nextchr = UCHARAT(++locinput); \ break +#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ + if ((OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ + break -- 1.5.6.3
From c98b72d84fe63405d164ba19d612249bdca3ce27 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 18:05:53 -0600 Subject: [PATCH] regexec.c: make macro lines fit in 80 cols Certain multi-line macros had their continuation backslashes way out. One line of each is longer than 80 chars, but no point in makeing all the lines that long. --- regexec.c | 136 ++++++++++++++++++++++++++++++------------------------------ 1 files changed, 68 insertions(+), 68 deletions(-) diff --git a/regexec.c b/regexec.c index 6d36a9c..7bdae3f 100644 --- a/regexec.c +++ b/regexec.c @@ -176,90 +176,90 @@ #endif -#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - case NAMEL: \ - PL_reg_flags |= RF_tainted; \ - /* FALL THROUGH */ \ - case NAME: \ - if (!nextchr) \ - sayNO; \ - if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ - if (!CAT2(PL_utf8_,CLASS)) { \ - bool ok; \ - ENTER; \ - save_re_context(); \ - ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ - assert(ok); \ - LEAVE; \ - } \ - if (!(OP(scan) == NAME \ +#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ + PL_reg_flags |= RF_tainted; \ + /* FALL THROUGH */ \ + case NAME: \ + if (!nextchr) \ + sayNO; \ + if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ + if (!CAT2(PL_utf8_,CLASS)) { \ + bool ok; \ + ENTER; \ + save_re_context(); \ + ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ + assert(ok); \ + LEAVE; \ + } \ + if (!(OP(scan) == NAME \ ? cBOOL(swash_fetch(CAT2(PL_utf8_,CLASS), (U8*)locinput, utf8_target)) \ - : LCFUNC_utf8((U8*)locinput))) \ - { \ - sayNO; \ - } \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - break; \ - } \ - /* Finished up by calling macro */ - -#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + : LCFUNC_utf8((U8*)locinput))) \ + { \ + sayNO; \ + } \ + locinput += PL_utf8skip[nextchr]; \ + nextchr = UCHARAT(locinput); \ + break; \ + } \ + /* Finished up by macro calling this one */ + +#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break /* Almost identical to the above, but has a case for a node that matches chars * between 128 and 255 using Unicode (latin1) semantics. */ #define CCC_TRY_AFF_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ - _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if (!(OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - case NAMEL: \ - PL_reg_flags |= RF_tainted; \ - /* FALL THROUGH */ \ - case NAME : \ - if (!nextchr && locinput >= PL_regeol) \ - sayNO; \ - if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ - if (!CAT2(PL_utf8_,CLASS)) { \ - bool ok; \ - ENTER; \ - save_re_context(); \ - ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ - assert(ok); \ - LEAVE; \ - } \ - if ((OP(scan) == NAME \ +#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ + PL_reg_flags |= RF_tainted; \ + /* FALL THROUGH */ \ + case NAME : \ + if (!nextchr && locinput >= PL_regeol) \ + sayNO; \ + if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ + if (!CAT2(PL_utf8_,CLASS)) { \ + bool ok; \ + ENTER; \ + save_re_context(); \ + ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ + assert(ok); \ + LEAVE; \ + } \ + if ((OP(scan) == NAME \ ? cBOOL(swash_fetch(CAT2(PL_utf8_,CLASS), (U8*)locinput, utf8_target)) \ - : LCFUNC_utf8((U8*)locinput))) \ - { \ - sayNO; \ - } \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - break; \ + : LCFUNC_utf8((U8*)locinput))) \ + { \ + sayNO; \ + } \ + locinput += PL_utf8skip[nextchr]; \ + nextchr = UCHARAT(locinput); \ + break; \ } -#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ +#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ - _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ +#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ if ((OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -- 1.5.6.3
From cc879eb3641e1205d5a3df06585e034729a57bf4 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 18:25:14 -0600 Subject: [PATCH] regcomp.h: Add macro to retrieve regnode flags --- regcomp.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/regcomp.h b/regcomp.h index 1ef9d2d..f40e5f2 100644 --- a/regcomp.h +++ b/regcomp.h @@ -271,6 +271,8 @@ struct regnode_charclass_class { /* has [[:blah:]] classes */ #undef STRING #define OP(p) ((p)->type) +#define FLAGS(p) ((p)->flags) /* Caution: Doesn't apply to all + regnode types */ #define OPERAND(p) (((struct regnode_string *)p)->string) #define MASK(p) ((char*)OPERAND(p)) #define STR_LEN(p) (((struct regnode_string *)p)->str_len) -- 1.5.6.3
From f8ac7be1a7cec7b860cd29af6cc0ff6307a7bf27 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 18:28:33 -0600 Subject: [PATCH] feature/unicode_strings.t: rmv trail blank --- lib/feature/unicode_strings.t | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/feature/unicode_strings.t b/lib/feature/unicode_strings.t index dce34bd..3dfb0cf 100644 --- a/lib/feature/unicode_strings.t +++ b/lib/feature/unicode_strings.t @@ -26,7 +26,7 @@ my @posix_to_lower = my @latin1_to_title = @posix_to_upper; -# Override the elements in the to_lower arrays that have different lower case +# Override the elements in the to_lower arrays that have different lower case # mappings for my $i (0x41 .. 0x5A) { $posix_to_lower[$i] = chr(ord($posix_to_lower[$i]) + 32); @@ -84,7 +84,7 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { my $cp = sprintf "U+%04X", $i; # First try using latin1 (Unicode) semantics. - use feature "unicode_strings"; + use feature "unicode_strings"; my $phrase = 'with uni8bit'; my $char = chr($i); -- 1.5.6.3
From 7326aef6528c8777477adfe085cf47ef398f4756 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 18:31:07 -0600 Subject: [PATCH] lib/feature/unicode_strings.t: Imprv test output This improves the phrasing of the output of the tests --- lib/feature/unicode_strings.t | 18 +++++++++--------- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/lib/feature/unicode_strings.t b/lib/feature/unicode_strings.t index 3dfb0cf..08785dc 100644 --- a/lib/feature/unicode_strings.t +++ b/lib/feature/unicode_strings.t @@ -86,7 +86,7 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { # First try using latin1 (Unicode) semantics. use feature "unicode_strings"; - my $phrase = 'with uni8bit'; + my $phrase = 'in uni8bit'; my $char = chr($i); my $pre_lc = $prefix->{'lc'}; my $pre_uc = $prefix->{'uc'}; @@ -98,17 +98,17 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { my $expected_lower = $pre_lc . $latin1_to_lower[$i] . $post_lc; is (uc($to_upper), $expected_upper, - display("$cp: $phrase: uc($to_upper) eq $expected_upper")); + display("$cp: $phrase: Verify uc($to_upper) eq $expected_upper")); is (lc($to_lower), $expected_lower, - display("$cp: $phrase: lc($to_lower) eq $expected_lower")); + display("$cp: $phrase: Verify lc($to_lower) eq $expected_lower")); if ($pre_uc eq "") { # Title case if null prefix. my $expected_title = $latin1_to_title[$i] . $post_lc; is (ucfirst($to_upper), $expected_title, - display("$cp: $phrase: ucfirst($to_upper) eq $expected_title")); + display("$cp: $phrase: Verify ucfirst($to_upper) eq $expected_title")); my $expected_lcfirst = $latin1_to_lower[$i] . $post_uc; is (lcfirst($to_lower), $expected_lcfirst, - display("$cp: $phrase: lcfirst($to_lower) eq $expected_lcfirst")); + display("$cp: $phrase: Verify lcfirst($to_lower) eq $expected_lcfirst")); } # Then try with posix semantics. @@ -125,17 +125,17 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { $expected_lower = $pre_lc . $posix_to_lower[$i] . $post_lc; is (uc($to_upper), $expected_upper, - display("$cp: $phrase: uc($to_upper) eq $expected_upper")); + display("$cp: $phrase: Verify uc($to_upper) eq $expected_upper")); is (lc($to_lower), $expected_lower, - display("$cp: $phrase: lc($to_lower) eq $expected_lower")); + display("$cp: $phrase: Verify lc($to_lower) eq $expected_lower")); if ($pre_uc eq "") { my $expected_title = $posix_to_title[$i] . $post_lc; is (ucfirst($to_upper), $expected_title, - display("$cp: $phrase: ucfirst($to_upper) eq $expected_title")); + display("$cp: $phrase: Verify ucfirst($to_upper) eq $expected_title")); my $expected_lcfirst = $posix_to_lower[$i] . $post_uc; is (lcfirst($to_lower), $expected_lcfirst, - display("$cp: $phrase: lcfirst($to_lower) eq $expected_lcfirst")); + display("$cp: $phrase: Verify lcfirst($to_lower) eq $expected_lcfirst")); } } } -- 1.5.6.3
From 7f8c928b5c089b66f92810e6cb55ce128e9471f1 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Sat, 14 Aug 2010 21:51:45 -0600 Subject: [PATCH] regcomp.c: convert to use cBOOL() --- regcomp.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/regcomp.c b/regcomp.c index 04fc038..5d04a05 100644 --- a/regcomp.c +++ b/regcomp.c @@ -358,9 +358,9 @@ static const scan_data_t zero_scan_data = #define SCF_TRIE_RESTUDY 0x4000 /* Do restudy? */ #define SCF_SEEN_ACCEPT 0x8000 -#define UTF (RExC_utf8 != 0) -#define LOC ((RExC_flags & RXf_PMf_LOCALE) != 0) -#define FOLD ((RExC_flags & RXf_PMf_FOLD) != 0) +#define UTF cBOOL(RExC_utf8) +#define LOC cBOOL(RExC_flags & RXf_PMf_LOCALE) +#define FOLD cBOOL(RExC_flags & RXf_PMf_FOLD) #define OOB_UNICODE 12345678 #define OOB_NAMEDCLASS -1 -- 1.5.6.3

Message body is not shown because it is too large.

Subject: PATCH: [perl #58182] partial: Add /l, /u, /d regex modifiers (but infix notation only)
Date: Tue, 21 Sep 2010 15:29:12 -0600
To: Perl5 Porters <perl5-porters [...] perl.org>
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 903b
Also available at git://github.com/khwilliamson/perl.git branch remods The attached sequence of patches adds (?dlu:...) regex modifiers, which we have discussed at great length on this list. There's also a little comment/doc cleanup for the (?^...) patch. Thanks to that patch, no tests that look at stringification need to change to accommodate these. The (?u...) modifier is recognized by this patch, but it actually doesn't do anything (except turn off the other modifiers) yet. When this patch, revised as necessary, is incorporated, I'll rebase and resubmit the patch that starts to use it. I have a concern that something needs to be done with: dist/B-Deparse/Deparse.pm dump.c ext/B/t/concise-xs.t that I haven't done, as code was changed in those for the /r modifier added this summer. However, I don't see what that should be, and my experiments don't show anything necessary.
From e660733c47d406bca8a1b1b42cb8daa1791672eb Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Mon, 20 Sep 2010 18:26:33 -0600 Subject: [PATCH] Change to use mnemonic instead of char constant The new '^' in (?^...) should really be a macro. --- regcomp.c | 5 +++-- regexp.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/regcomp.c b/regcomp.c index 2871e4a..bd43d5d 100644 --- a/regcomp.c +++ b/regcomp.c @@ -4440,7 +4440,7 @@ Perl_re_compile(pTHX_ SV * const pattern, U32 pm_flags) SvFLAGS(rx) |= SvUTF8(pattern); *p++='('; *p++='?'; if (has_minus) { /* If a default, cover it using the caret */ - *p++='^'; + *p++= DEFAULT_PAT_MOD; } if (has_p) *p++ = KEEPCOPY_PAT_MOD; /*'p'*/ @@ -6118,7 +6118,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) RExC_parse--; /* for vFAIL to print correctly */ vFAIL("Sequence (? incomplete"); break; - case '^': /* Use default flags with the exceptions that follow */ + case DEFAULT_PAT_MOD: /* Use default flags with the exceptions + that follow */ has_use_defaults = TRUE; STD_PMMOD_FLAGS_CLEAR(&RExC_flags); goto parse_flags; diff --git a/regexp.h b/regexp.h index 198b510..17f9983 100644 --- a/regexp.h +++ b/regexp.h @@ -247,6 +247,7 @@ and check for NULL. * for compatibility reasons with Regexp::Common which highjacked (?k:...) * for its own uses. So 'k' is out as well. */ +#define DEFAULT_PAT_MOD '^' /* Short for all the default modifiers */ #define EXEC_PAT_MOD 'e' #define KEEPCOPY_PAT_MOD 'p' #define ONCE_PAT_MOD 'o' -- 1.5.6.3
From 28053908e543fe5063a185c9547ef13ed6528563 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Mon, 20 Sep 2010 18:29:59 -0600 Subject: [PATCH] Change comments, documentation for new (?^...) I overlooked these earlier in adding the caret notation. --- ext/B/t/OptreeCheck.pm | 6 +++--- pod/perlreapi.pod | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/ext/B/t/OptreeCheck.pm b/ext/B/t/OptreeCheck.pm index ce2482d..a85c5fc 100644 --- a/ext/B/t/OptreeCheck.pm +++ b/ext/B/t/OptreeCheck.pm @@ -102,11 +102,11 @@ various modes. # 7 <1> leavesub\[\d+ refs?\] K/REFC,1 # $)/ # got: '2 <#> gvsv[*b] s' - # want: (?-xism:2 <\$> gvsv\(\*b\) s) + # want: (?^:2 <\$> gvsv\(\*b\) s) # got: '3 <$> const[IV 42] s' - # want: (?-xism:3 <\$> const\(IV 42\) s) + # want: (?^:3 <\$> const\(IV 42\) s) # got: '5 <#> gvsv[*a] s' - # want: (?-xism:5 <\$> gvsv\(\*a\) s) + # want: (?^:5 <\$> gvsv\(\*a\) s) # remainder: # 2 <#> gvsv[*b] s # 3 <$> const[IV 42] s diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod index 7dc9645..cc76502 100644 --- a/pod/perlreapi.pod +++ b/pod/perlreapi.pod @@ -655,7 +655,7 @@ Used during execution phase for managing search and replace patterns. =head2 C<wrapped> C<wraplen> Stores the string C<qr//> stringifies to. The perl engine for example -stores C<(?-xism:eek)> in the case of C<qr/eek/>. +stores C<(?^:eek)> in the case of C<qr/eek/>. When using a custom engine that doesn't support the C<(?:)> construct for inline modifiers, it's probably best to have C<qr//> stringify to -- 1.5.6.3
From 633a45e58260fa6078f8a6a2f360a2b5442a6d07 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Mon, 20 Sep 2010 18:31:00 -0600 Subject: [PATCH] Change .t to use new (?^...) There is a line in this .t which uses the old regex stringification. The contents are not currently tested for, but for cleanliness, change to the new. --- ext/B/t/optree_constants.t | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ext/B/t/optree_constants.t b/ext/B/t/optree_constants.t index 47afea4..f293228 100644 --- a/ext/B/t/optree_constants.t +++ b/ext/B/t/optree_constants.t @@ -54,7 +54,7 @@ my $want = { # expected types, how value renders in-line, todos (maybe) myfl => [ 'NV', myfl ], myint => [ 'IV', myint ], $] >= 5.011 ? ( - myrex => [ $RV_class, '\\\\"\\(?-xism:Foo\\)"' ], + myrex => [ $RV_class, '\\\\"\\(?^:Foo\\)"' ], ) : ( myrex => [ $RV_class, '\\\\' ], ), -- 1.5.6.3
From cb702cc132f71445c26913318e170abd55cc28fd Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Mon, 20 Sep 2010 18:57:24 -0600 Subject: [PATCH] Allocate bit for /u modifier make regen required This allocates an unused bit in the structures. The offsets change so as to not disturb other bits. --- op.h | 2 +- op_reg_common.h | 3 ++- regexp.h | 2 +- regnodes.h | 4 ++-- 4 files changed, 6 insertions(+), 5 deletions(-) diff --git a/op.h b/op.h index 2ffd3e6..05aa652 100644 --- a/op.h +++ b/op.h @@ -362,7 +362,7 @@ struct pmop { /* Leave some space, so future bit allocations can go either in the shared or * unshared area without affecting binary compatibility */ -#define PMf_BASE_SHIFT (_RXf_PMf_SHIFT_NEXT+8) +#define PMf_BASE_SHIFT (_RXf_PMf_SHIFT_NEXT+7) /* taint $1 etc. if target tainted */ #define PMf_RETAINT (1<<(PMf_BASE_SHIFT+0)) diff --git a/op_reg_common.h b/op_reg_common.h index d4e3987..92230c8 100644 --- a/op_reg_common.h +++ b/op_reg_common.h @@ -29,10 +29,11 @@ #define RXf_PMf_EXTENDED (1 << (RXf_PMf_STD_PMMOD_SHIFT+3)) /* /x */ #define RXf_PMf_KEEPCOPY (1 << (RXf_PMf_STD_PMMOD_SHIFT+4)) /* /p */ #define RXf_PMf_LOCALE (1 << (RXf_PMf_STD_PMMOD_SHIFT+5)) +#define RXf_PMf_UNICODE (1 << (RXf_PMf_STD_PMMOD_SHIFT+6)) /* Next available bit after the above. Name begins with '_' so won't be * exported by B */ -#define _RXf_PMf_SHIFT_NEXT (RXf_PMf_STD_PMMOD_SHIFT+6) +#define _RXf_PMf_SHIFT_NEXT (RXf_PMf_STD_PMMOD_SHIFT+7) /* Mask of the above bits. These need to be transferred from op_pmflags to * re->extflags during compilation */ diff --git a/regexp.h b/regexp.h index 17f9983..3900ee1 100644 --- a/regexp.h +++ b/regexp.h @@ -288,7 +288,7 @@ and check for NULL. /* Leave some space, so future bit allocations can go either in the shared or * unshared area without affecting binary compatibility */ -#define RXf_BASE_SHIFT (_RXf_PMf_SHIFT_NEXT+3) +#define RXf_BASE_SHIFT (_RXf_PMf_SHIFT_NEXT+2) /* Anchor and GPOS related stuff */ #define RXf_ANCH_BOL (1<<(RXf_BASE_SHIFT+0)) diff --git a/regnodes.h b/regnodes.h index d132013..f5aacc2 100644 --- a/regnodes.h +++ b/regnodes.h @@ -625,14 +625,14 @@ EXTCONST char * const PL_reg_name[] = { EXTCONST char * PL_reg_extflags_name[]; #else EXTCONST char * const PL_reg_extflags_name[] = { - /* Bits in extflags defined: 11111111111111111111111000111111 */ + /* Bits in extflags defined: 11111111111111111111111001111111 */ "MULTILINE", /* 0x00000001 */ "SINGLELINE", /* 0x00000002 */ "FOLD", /* 0x00000004 */ "EXTENDED", /* 0x00000008 */ "KEEPCOPY", /* 0x00000010 */ "LOCALE", /* 0x00000020 */ - "UNUSED_BIT_6", /* 0x00000040 */ + "UNICODE", /* 0x00000040 */ "UNUSED_BIT_7", /* 0x00000080 */ "UNUSED_BIT_8", /* 0x00000100 */ "ANCH_BOL", /* 0x00000200 */ -- 1.5.6.3
From fed0a9e5099e1a42c9fe12d2fe6cba40659e3a44 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Mon, 20 Sep 2010 19:16:24 -0600 Subject: [PATCH] re.pm: Change comment to use new (?^...) --- ext/re/re.pm | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/ext/re/re.pm b/ext/re/re.pm index 9341feb..2d6784a 100644 --- a/ext/re/re.pm +++ b/ext/re/re.pm @@ -415,7 +415,7 @@ C<qr//> with the same pattern inside. If the argument is not a compiled reference then this routine returns false but defined in scalar context, and the empty list in list context. Thus the following - if (regexp_pattern($ref) eq '(?i-xsm:foo)') + if (regexp_pattern($ref) eq '(?^i:foo)') will be warning free regardless of what $ref actually is. -- 1.5.6.3

Message body is not shown because it is too large.

From 413fb85605639cdd237916c3757f6b0006848b6f Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Tue, 21 Sep 2010 15:09:12 -0600 Subject: [PATCH] regcomp.c: Convert some things to use cBOOL() --- regcomp.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/regcomp.c b/regcomp.c index 0bd30ab..d0c88d7 100644 --- a/regcomp.c +++ b/regcomp.c @@ -366,10 +366,10 @@ static const scan_data_t zero_scan_data = #define SCF_TRIE_RESTUDY 0x4000 /* Do restudy? */ #define SCF_SEEN_ACCEPT 0x8000 -#define UTF (RExC_utf8 != 0) -#define LOC ((RExC_flags & RXf_PMf_LOCALE) != 0) -#define UNI_SEMANTICS ((RExC_flags & RXf_PMf_UNICODE) != 0) -#define FOLD ((RExC_flags & RXf_PMf_FOLD) != 0) +#define UTF cBOOL(RExC_utf8) +#define LOC cBOOL(RExC_flags & RXf_PMf_LOCALE) +#define UNI_SEMANTICS cBOOL(RExC_flags & RXf_PMf_UNICODE) +#define FOLD cBOOL(RExC_flags & RXf_PMf_FOLD) #define OOB_UNICODE 12345678 #define OOB_NAMEDCLASS -1 -- 1.5.6.3
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Tue Sep 21 14:30:01 2010, public@khwilliamson.com wrote: Show quoted text
> Also available at git://github.com/khwilliamson/perl.git > branch remods > > The attached sequence of patches adds (?dlu:...) regex modifiers, which > we have discussed at great length on this list. There's also a little > comment/doc cleanup for the (?^...) patch. Thanks to that patch, no > tests that look at stringification need to change to accommodate these. > > The (?u...) modifier is recognized by this patch, but it actually > doesn't do anything (except turn off the other modifiers) yet. When > this patch, revised as necessary, is incorporated, I'll rebase and > resubmit the patch that starts to use it. > > I have a concern that something needs to be done with: > dist/B-Deparse/Deparse.pm > dump.c > ext/B/t/concise-xs.t > > that I haven't done, as code was changed in those for the /r modifier > added this summer. However, I don't see what that should be, and my > experiments don't show anything necessary. >
Thank you. These have been applied as: 855088127b85a7b03f3833b2274d4f26946f203d (1) ed215d3cfa99c10e51ffe780098978517bd67537 (2) 4c2c679ff9fc18054795b9b7b28e37453e57d146 (3) 9de15fec376a8ff90a38fad0ff322c72c2995765 (4 & 6) dff5e0c4913cbbc4e73cece5cd747e7aa222db67 (5) 43fead97d9090f614849cdd8195a6900ee682952 (7) Four and six I combined, as make failed with just number four applied.
Subject: PATCH [perl #58182] partial: Unicode semantics for \s, \w, \b
Date: Fri, 24 Sep 2010 00:03:48 -0600
To: perlbug [...] perl.org
From: karl williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 748b
The attached series of commits adds Unicode semantics for \s, \b, and \w under the scope of feature unicode_strings, or with the /u regex modifier. It contains changes to handy.h that are also done by [perl #78022], so they should be skipped if that patch has been applied already. It is also available at git://github.com/khwilliamson/perl.git, branch remods. The original version of this patch is 9 months old, and was not applied to blead because the regex modifier /u was not available, which it finally is. This partially fixes the "Unicode Bug", which is in perltodo.pod. I have patches in various states of preparation that extend this to fix the entire Unicode bug, but this is a necessary next step, modified as review requires.
From 463802da5a8bace05c58371955c2baabffa833c2 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:13:51 -0600 Subject: [PATCH] handy.h: Add isSPACE_L1() with Unicode semantics --- handy.h | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/handy.h b/handy.h index b41c1c8..7bacab3 100644 --- a/handy.h +++ b/handy.h @@ -515,6 +515,8 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc #define isALPHA(c) (isUPPER(c) || isLOWER(c)) /* ALPHAU includes Unicode semantics for latin1 characters. It has an extra * >= AA test to speed up ASCII-only tests at the expense of the others */ +/* XXX decide whether to document the ALPHAU, ALNUMU and isSPACE_L1 functions. + * Most of these should be implemented as table lookup for speed */ #define isALPHAU(c) (isALPHA(c) || (NATIVE_TO_UNI((U8) c) >= 0xAA \ && ((NATIVE_TO_UNI((U8) c) >= 0xC0 \ && NATIVE_TO_UNI((U8) c) != 0xD7 && NATIVE_TO_UNI((U8) c) != 0xF7) \ @@ -527,6 +529,8 @@ patched there. The file as of this writing is cpan/Devel-PPPort/parts/inc/misc #define isCHARNAME_CONT(c) (isALNUMU(c) || (c) == ' ' || (c) == '-' || (c) == '(' || (c) == ')' || (c) == ':' || NATIVE_TO_UNI((U8) c) == 0xA0) #define isSPACE(c) \ ((c) == ' ' || (c) == '\t' || (c) == '\n' || (c) =='\r' || (c) == '\f') +#define isSPACE_L1(c) (isSPACE(c) \ + || (NATIVE_TO_UNI(c) == 0x85 || NATIVE_TO_UNI(c) == 0xA0)) #define isPSXSPC(c) (isSPACE(c) || (c) == '\v') #define isBLANK(c) ((c) == ' ' || (c) == '\t') #define isDIGIT(c) ((c) >= '0' && (c) <= '9') -- 1.5.6.3
From 34680bb653a411eb819db8e36dfaa300bc9a3b40 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:15:36 -0600 Subject: [PATCH] perlrebackslash: Fix poor grammar --- pod/perlrebackslash.pod | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index d460f7f..91c4d7d 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -475,7 +475,7 @@ backslash sequences. =item \A C<\A> only matches at the beginning of the string. If the C</m> modifier -isn't used, then C</\A/> is equivalent with C</^/>. However, if the C</m> +isn't used, then C</\A/> is equivalent to C</^/>. However, if the C</m> modifier is used, then C</^/> matches internal newlines, but the meaning of C</\A/> isn't changed by the C</m> modifier. C<\A> matches at the beginning of the string regardless whether the C</m> modifier is used. @@ -483,7 +483,7 @@ of the string regardless whether the C</m> modifier is used. =item \z, \Z C<\z> and C<\Z> match at the end of the string. If the C</m> modifier isn't -used, then C</\Z/> is equivalent with C</$/>, that is, it matches at the +used, then C</\Z/> is equivalent to C</$/>, that is, it matches at the end of the string, or before the newline at the end of the string. If the C</m> modifier is used, then C</$/> matches at internal newlines, but the meaning of C</\Z/> isn't changed by the C</m> modifier. C<\Z> matches at -- 1.5.6.3
From f9f5fbe4510969ec71c9f1ded5f2e81de4a5e35d Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:18:12 -0600 Subject: [PATCH] regcomp.c: Fix white space, cuddled else --- regcomp.c | 23 ++++++++++------------- 1 files changed, 10 insertions(+), 13 deletions(-) diff --git a/regcomp.c b/regcomp.c index ff9f87b..a865eac 100644 --- a/regcomp.c +++ b/regcomp.c @@ -881,7 +881,7 @@ S_cl_or(const RExC_state_t *pRExC_state, struct regnode_charclass_class *cl, con Dumps the final compressed table form of the trie to Perl_debug_log. Used for debugging make_trie(). */ - + STATIC void S_dump_trie(pTHX_ const struct _reg_trie_data *trie, HV *widecharmap, AV *revcharmap, U32 depth) @@ -3207,7 +3207,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, /* These are the cases when once a subexpression fails at a particular position, it cannot succeed even after backtracking at the enclosing scope. - + XXXX what if minimal match and we are at the initial run of {n,m}? */ if ((mincount != maxcount - 1) && (maxcount != REG_INFTY)) @@ -3348,7 +3348,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, #if 0 while ( nxt1 && (OP(nxt1) != WHILEM)) { regnode *nnxt = regnext(nxt1); - if (nnxt == nxt) { if (reg_off_by_arg[OP(nxt1)]) ARG_SET(nxt1, nxt2 - nxt1); @@ -3415,7 +3414,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, if (UTF) old = utf8_hop((U8*)s, old) - (U8*)s; - l -= old; /* Get the added string: */ last_str = newSVpvn_utf8(s + old, l, UTF); @@ -3503,13 +3501,13 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, if (flags & SCF_DO_STCLASS_AND) { for (value = 0; value < 256; value++) if (!is_VERTWS_cp(value)) - ANYOF_BITMAP_CLEAR(data->start_class, value); - } - else { + ANYOF_BITMAP_CLEAR(data->start_class, value); + } + else { for (value = 0; value < 256; value++) if (is_VERTWS_cp(value)) - ANYOF_BITMAP_SET(data->start_class, value); - } + ANYOF_BITMAP_SET(data->start_class, value); + } if (flags & SCF_DO_STCLASS_OR) cl_and(data->start_class, and_withp); flags &= ~SCF_DO_STCLASS; @@ -3522,7 +3520,6 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, data->pos_delta += 1; data->longest = &(data->longest_float); } - } else if (OP(scan) == FOLDCHAR) { int d = ARG(scan)==0xDF ? 1 : 2; @@ -3620,7 +3617,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (!isALNUM(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; @@ -3709,7 +3706,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (isDIGIT(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; @@ -3726,7 +3723,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, else { for (value = 0; value < 256; value++) if (!isDIGIT(value)) - ANYOF_BITMAP_SET(data->start_class, value); + ANYOF_BITMAP_SET(data->start_class, value); } } break; -- 1.5.6.3
From 4ab9de5b8b45bcde30a19f10053e7006c6674ab7 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:49:37 -0600 Subject: [PATCH] Subject: [PATCH] regexec.c: add and refactor macros Add macros that will have unicode semantics; these share much code in common with ones that don't. So factor out that common code. These might be good candidates for inline functions. --- regexec.c | 32 +++++++++++++++++++++++++++----- 1 files changed, 27 insertions(+), 5 deletions(-) diff --git a/regexec.c b/regexec.c index 881f8c2..e26d134 100644 --- a/regexec.c +++ b/regexec.c @@ -176,11 +176,11 @@ #endif -#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - case NAMEL: \ +#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ PL_reg_flags |= RF_tainted; \ /* FALL THROUGH */ \ - case NAME: \ + case NAME: \ if (!nextchr) \ sayNO; \ if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ @@ -202,12 +202,25 @@ nextchr = UCHARAT(locinput); \ break; \ } \ + /* Finished up by calling macro */ + +#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ sayNO; \ nextchr = UCHARAT(++locinput); \ break -#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ +/* Almost identical to the above, but has a case for a node that matches chars + * between 128 and 255 using Unicode (latin1) semantics. */ +#define CCC_TRY_AFF_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if (!(OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ + break + +#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ case NAMEL: \ PL_reg_flags |= RF_tainted; \ /* FALL THROUGH */ \ @@ -232,13 +245,22 @@ locinput += PL_utf8skip[nextchr]; \ nextchr = UCHARAT(locinput); \ break; \ - } \ + } + +#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ sayNO; \ nextchr = UCHARAT(++locinput); \ break +#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ + if ((OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ + break -- 1.5.6.3
From 19ff8586e34e821e2ee81c455782898b4feb8646 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:51:09 -0600 Subject: [PATCH] Subject: [PATCH] regexec.c: make macros fit 80 cols Certain multi-line macros had their continuation backslashes way out. One line of each is longer than 80 chars, but no point in makeing all the lines that long. --- regexec.c | 136 ++++++++++++++++++++++++++++++------------------------------ 1 files changed, 68 insertions(+), 68 deletions(-) diff --git a/regexec.c b/regexec.c index e26d134..b8a8480 100644 --- a/regexec.c +++ b/regexec.c @@ -176,90 +176,90 @@ #endif -#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - case NAMEL: \ - PL_reg_flags |= RF_tainted; \ - /* FALL THROUGH */ \ - case NAME: \ - if (!nextchr) \ - sayNO; \ - if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ - if (!CAT2(PL_utf8_,CLASS)) { \ - bool ok; \ - ENTER; \ - save_re_context(); \ - ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ - assert(ok); \ - LEAVE; \ - } \ - if (!(OP(scan) == NAME \ +#define _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ + PL_reg_flags |= RF_tainted; \ + /* FALL THROUGH */ \ + case NAME: \ + if (!nextchr) \ + sayNO; \ + if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ + if (!CAT2(PL_utf8_,CLASS)) { \ + bool ok; \ + ENTER; \ + save_re_context(); \ + ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ + assert(ok); \ + LEAVE; \ + } \ + if (!(OP(scan) == NAME \ ? cBOOL(swash_fetch(CAT2(PL_utf8_,CLASS), (U8*)locinput, utf8_target)) \ - : LCFUNC_utf8((U8*)locinput))) \ - { \ - sayNO; \ - } \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - break; \ - } \ - /* Finished up by calling macro */ - -#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + : LCFUNC_utf8((U8*)locinput))) \ + { \ + sayNO; \ + } \ + locinput += PL_utf8skip[nextchr]; \ + nextchr = UCHARAT(locinput); \ + break; \ + } \ + /* Finished up by macro calling this one */ + +#define CCC_TRY_AFF(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if (!(OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break /* Almost identical to the above, but has a case for a node that matches chars * between 128 and 255 using Unicode (latin1) semantics. */ #define CCC_TRY_AFF_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ - _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + _CCC_TRY_AFF_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ if (!(OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - case NAMEL: \ - PL_reg_flags |= RF_tainted; \ - /* FALL THROUGH */ \ - case NAME : \ - if (!nextchr && locinput >= PL_regeol) \ - sayNO; \ - if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ - if (!CAT2(PL_utf8_,CLASS)) { \ - bool ok; \ - ENTER; \ - save_re_context(); \ - ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ - assert(ok); \ - LEAVE; \ - } \ - if ((OP(scan) == NAME \ +#define _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + case NAMEL: \ + PL_reg_flags |= RF_tainted; \ + /* FALL THROUGH */ \ + case NAME : \ + if (!nextchr && locinput >= PL_regeol) \ + sayNO; \ + if (utf8_target && UTF8_IS_CONTINUED(nextchr)) { \ + if (!CAT2(PL_utf8_,CLASS)) { \ + bool ok; \ + ENTER; \ + save_re_context(); \ + ok=CAT2(is_utf8_,CLASS)((const U8*)STR); \ + assert(ok); \ + LEAVE; \ + } \ + if ((OP(scan) == NAME \ ? cBOOL(swash_fetch(CAT2(PL_utf8_,CLASS), (U8*)locinput, utf8_target)) \ - : LCFUNC_utf8((U8*)locinput))) \ - { \ - sayNO; \ - } \ - locinput += PL_utf8skip[nextchr]; \ - nextchr = UCHARAT(locinput); \ - break; \ + : LCFUNC_utf8((U8*)locinput))) \ + { \ + sayNO; \ + } \ + locinput += PL_utf8skip[nextchr]; \ + nextchr = UCHARAT(locinput); \ + break; \ } -#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ - _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ - if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ +#define CCC_TRY_NEG(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNC) \ + if ((OP(scan) == NAME ? FUNC(nextchr) : LCFUNC(nextchr))) \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ - _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ +#define CCC_TRY_NEG_U(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU,LCFUNC) \ + _CCC_TRY_NEG_COMMON(NAME,NAMEL,CLASS,STR,LCFUNC_utf8,FUNCU) \ if ((OP(scan) == NAMEL ? LCFUNC(nextchr) : (FUNCU(nextchr) && (isASCII(nextchr) || (FLAGS(scan) & USE_UNI))))) \ - sayNO; \ - nextchr = UCHARAT(++locinput); \ + sayNO; \ + nextchr = UCHARAT(++locinput); \ break -- 1.5.6.3
From ef5ccd5da7b4cd3878b45b812442c50ac0e1c67b Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:53:10 -0600 Subject: [PATCH] Subject: regcomp.h: Add macro to get regnode flags --- regcomp.h | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/regcomp.h b/regcomp.h index 362a8ed..1fb0e51 100644 --- a/regcomp.h +++ b/regcomp.h @@ -271,6 +271,8 @@ struct regnode_charclass_class { /* has [[:blah:]] classes */ #undef STRING #define OP(p) ((p)->type) +#define FLAGS(p) ((p)->flags) /* Caution: Doesn't apply to all + regnode types */ #define OPERAND(p) (((struct regnode_string *)p)->string) #define MASK(p) ((char*)OPERAND(p)) #define STR_LEN(p) (((struct regnode_string *)p)->str_len) -- 1.5.6.3
From f8138f1fb7d5a77fb61bc477d7337fb7109a9a9c Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:55:15 -0600 Subject: [PATCH] Subject: unicode_strings.t: rmv trail blanks --- lib/feature/unicode_strings.t | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/feature/unicode_strings.t b/lib/feature/unicode_strings.t index dce34bd..3dfb0cf 100644 --- a/lib/feature/unicode_strings.t +++ b/lib/feature/unicode_strings.t @@ -26,7 +26,7 @@ my @posix_to_lower = my @latin1_to_title = @posix_to_upper; -# Override the elements in the to_lower arrays that have different lower case +# Override the elements in the to_lower arrays that have different lower case # mappings for my $i (0x41 .. 0x5A) { $posix_to_lower[$i] = chr(ord($posix_to_lower[$i]) + 32); @@ -84,7 +84,7 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { my $cp = sprintf "U+%04X", $i; # First try using latin1 (Unicode) semantics. - use feature "unicode_strings"; + use feature "unicode_strings"; my $phrase = 'with uni8bit'; my $char = chr($i); -- 1.5.6.3
From ac922ed45efd6da5bb07945522aadc048149be6b Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 07:57:02 -0600 Subject: [PATCH] Subject: unicode_strings.t: Imprv test output This improves the phrasing of the output of the tests --- lib/feature/unicode_strings.t | 18 +++++++++--------- 1 files changed, 9 insertions(+), 9 deletions(-) diff --git a/lib/feature/unicode_strings.t b/lib/feature/unicode_strings.t index 3dfb0cf..08785dc 100644 --- a/lib/feature/unicode_strings.t +++ b/lib/feature/unicode_strings.t @@ -86,7 +86,7 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { # First try using latin1 (Unicode) semantics. use feature "unicode_strings"; - my $phrase = 'with uni8bit'; + my $phrase = 'in uni8bit'; my $char = chr($i); my $pre_lc = $prefix->{'lc'}; my $pre_uc = $prefix->{'uc'}; @@ -98,17 +98,17 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { my $expected_lower = $pre_lc . $latin1_to_lower[$i] . $post_lc; is (uc($to_upper), $expected_upper, - display("$cp: $phrase: uc($to_upper) eq $expected_upper")); + display("$cp: $phrase: Verify uc($to_upper) eq $expected_upper")); is (lc($to_lower), $expected_lower, - display("$cp: $phrase: lc($to_lower) eq $expected_lower")); + display("$cp: $phrase: Verify lc($to_lower) eq $expected_lower")); if ($pre_uc eq "") { # Title case if null prefix. my $expected_title = $latin1_to_title[$i] . $post_lc; is (ucfirst($to_upper), $expected_title, - display("$cp: $phrase: ucfirst($to_upper) eq $expected_title")); + display("$cp: $phrase: Verify ucfirst($to_upper) eq $expected_title")); my $expected_lcfirst = $latin1_to_lower[$i] . $post_uc; is (lcfirst($to_lower), $expected_lcfirst, - display("$cp: $phrase: lcfirst($to_lower) eq $expected_lcfirst")); + display("$cp: $phrase: Verify lcfirst($to_lower) eq $expected_lcfirst")); } # Then try with posix semantics. @@ -125,17 +125,17 @@ for my $prefix (\%empty, \%posix, \%cyrillic, \%latin1) { $expected_lower = $pre_lc . $posix_to_lower[$i] . $post_lc; is (uc($to_upper), $expected_upper, - display("$cp: $phrase: uc($to_upper) eq $expected_upper")); + display("$cp: $phrase: Verify uc($to_upper) eq $expected_upper")); is (lc($to_lower), $expected_lower, - display("$cp: $phrase: lc($to_lower) eq $expected_lower")); + display("$cp: $phrase: Verify lc($to_lower) eq $expected_lower")); if ($pre_uc eq "") { my $expected_title = $posix_to_title[$i] . $post_lc; is (ucfirst($to_upper), $expected_title, - display("$cp: $phrase: ucfirst($to_upper) eq $expected_title")); + display("$cp: $phrase: Verify ucfirst($to_upper) eq $expected_title")); my $expected_lcfirst = $posix_to_lower[$i] . $post_uc; is (lcfirst($to_lower), $expected_lcfirst, - display("$cp: $phrase: lcfirst($to_lower) eq $expected_lcfirst")); + display("$cp: $phrase: Verify lcfirst($to_lower) eq $expected_lcfirst")); } } } -- 1.5.6.3
From 4ec43c4e5f0c1c6381214ce4470a7ac6de6d17af Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 09:35:53 -0600 Subject: [PATCH] perlre.pod: slight rewording --- pod/perlre.pod | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/pod/perlre.pod b/pod/perlre.pod index b9216c1..fd3bce6 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -662,10 +662,10 @@ pragma. Note that the C<d>, C<l>, C<p>, and C<u> modifiers are special in that they can only be enabled, not disabled, and the C<d>, C<l>, and C<u> -modifiers are mutually exclusive; a maximum of one may appear in the -construct. Specifying one de-specifies the others. Thus, for example, -C<(?-p)> and C<(?-d:...)> are meaningless and will warn when compiled -under C<use warnings>. +modifiers are mutually exclusive: specifying one de-specifies the +others, and a maximum of one may appear in the construct. Thus, for +example, C<(?-p)>, C<(?-d:...)>, and C<(?-dl:...)> will warn when +compiled under C<use warnings>. Note also that the C<p> modifier is special in that its presence anywhere in a pattern has a global effect. -- 1.5.6.3
From 7e4648eded2f8c7bff8a7b803861cb6b557543da Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Thu, 23 Sep 2010 12:46:32 -0600 Subject: [PATCH] Subject: handy.h: Add isWORDCHAR_L1() macro This is a synonym for isALNUMU --- handy.h | 1 + 1 files changed, 1 insertions(+), 0 d