New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
\p{user-defined} should be immune from later Unicode releases #17025
Comments
From perlbug@jgreely.comCreated by perlbug@jgreely.comThis is a bug report for perl from perlbug@jgreely.com, ----------------------------------------------------------------- Example code: 5.20.2: 5.28.1: Perl Info
|
From @jkeenanOn Mon, 27 May 2019 08:29:53 GMT, jgreely wrote:
Would it be possible for you to provide a patch to pod/perlunicode.pod in the Perl 5 core distribution so that it is clear exactly which example in that document needs revision? I couldn't locate the specific code example you provided in that document, even though I was able to confirm the difference in behavior between those two perl versions. Ideally, we would prefer against perl 5 blead, i.e., against pod/perlunicode.pod in a git checkout of core. It would also be good to attach the patch rather than including it inline. Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From perlbug@jgreely.comOn Mon, 27 May 2019 07:18:59 -0700, jkeenan wrote:
All instances of the string 'InKana' must be replaced with a new name to work under recent Perl versions. I chose 'InHiraganaKatakana', which may be longer than desired for an example. See attachment. |
From perlbug@jgreely.cominkana.patchdiff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 8f09a18fca..3e3a5f031c 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -1144,7 +1144,7 @@ hexadecimal code points for a range; or a single hexadecimal code point.
For example, to define a property that covers both the Japanese
syllabaries (hiragana and katakana), you can define
- sub InKana {
+ sub InHiraganaKatakana {
return <<END;
3040\t309F
30A0\t30FF
@@ -1152,11 +1152,11 @@ syllabaries (hiragana and katakana), you can define
}
Imagine that the here-doc end marker is at the beginning of the line.
-Now you can use C<\p{InKana}> and C<\P{InKana}>.
+Now you can use C<\p{InHiraganaKatakana}> and C<\P{InHiraganaKatakana}>.
You could also have used the existing block property names:
- sub InKana {
+ sub InHiraganaKatakana {
return <<'END';
+utf8::InHiragana
+utf8::InKatakana
@@ -1167,7 +1167,7 @@ Suppose you wanted to match only the allocated characters,
not the raw block ranges: in other words, you want to remove
the unassigned characters:
- sub InKana {
+ sub InHiraganaKatakana {
return <<'END';
+utf8::InHiragana
+utf8::InKatakana
@@ -1177,7 +1177,7 @@ the unassigned characters:
The negation is useful for defining (surprise!) negated classes.
- sub InNotKana {
+ sub InNotHiraganaKatakana {
return <<'END';
!utf8::InHiragana
-utf8::InKatakana
@@ -1186,10 +1186,10 @@ The negation is useful for defining (surprise!) negated classes.
}
This will match all non-Unicode code points, since every one of them is
-not in Kana. You can use intersection to exclude these, if desired, as
+not in HiraganaKatakana. You can use intersection to exclude these, if desired, as
this modified example shows:
- sub InNotKana {
+ sub InNotHiraganaKatakana {
return <<'END';
!utf8::InHiragana
-utf8::InKatakana
|
From @jkeenanOn Mon, 27 May 2019 15:47:57 GMT, jgreely wrote:
Patch looks good to me; available in jkeenan/rt-134146-unicode branch. TonyC, khw, list: Comments? Thank you very much. |
From @khwilliamsonOn Mon, 27 May 2019 08:47:57 -0700, jgreely wrote:
Actually, this isn't true. This program shows that you can define your own InKana property sub InKana { qr/\p{InKana}/; When run under blead with -Dr, we get However, if instead the program is sub InKana { and run under blead -Dr, we get Final program: What is happening here is that perl looks for a user-defined property. If it finds one, it uses it. If it doesn't find it, it looks for an official Unicode property. If it finds one, it uses that. If not, it will defer looking up the property until execution. If it is still undefined at the point it is first needed, it croaks, otherwise it uses it. My guess is that the program that led to the erroneous conclusion is like the 2nd program above. Declaring sub InKana; I think the patch to the documentation would be to add something about this timing issue |
From perlbug@jgreely.comMy program (which worked from at least 5.10.1 through 5.20.2) did indeed reference "/^\p{InKana}+$/" before defining "sub InKana {...}". It feels very un-Perly to have to put a sub at the top of your script in order to get correct behavior, so I'd agree that the documentation needs an update. |
From @khwilliamsonOn 5/27/19 2:48 PM, J Greely via RT wrote:
You have convinced me, without perhaps intending to, that this is a bug. |
From @khwilliamsonOn Fri, 31 May 2019 21:41:45 -0700, public@khwilliamson.com wrote:
It turns out that this particular instance is symptomatic of another problem. The 'In' prefix is only supposed to be used for Block properties. But it was being accepted for all. This bug has apparently been there from the beginning of such things. That issue has been fixed by 74333e9, which means no documentation should be changed. The bottom line is that later Unicode releases with their new property names should not override existing code that uses a particular user-defined property name. I have changed the title of this ticket accordingly. And it turns out that there is a fairly simple solution that does this. That is to never expand until runtime a property whose name could be a user-defined one. If at that time no appropriate sub has been defined, then look for an official properlty with that name. This means slower execution if and only if the property name begins with In or Is but only the first time the match is tried. I will defer fixing this for the time being. |
Prior to this patch, they only sometimes overrode.
Prior to this patch, they only sometimes overrode.
Migrated from rt.perl.org#134146 (status was 'open')
Searchable as RT134146$
The text was updated successfully, but these errors were encountered: