Skip Menu |
Report information
Id: 133547
Status: pending release
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: ph10 [at] hermes.cam.ac.uk
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: medium
Type: core
Perl Version: 5.28.0
Fixed In: (no value)



From: ph10 [...] hermes.cam.ac.uk
Date: Thu, 27 Sep 2018 17:48:25 +0100 (BST)
Subject: Inconsistency in Script Run
To: perlbug [...] perl.org
Download (untitled) / with headers
text/plain 5.2k
From: ph10@cam.ac.uk To: perlbug@perl.org Message-Id: <5.28.0_31268_1538066218@quercite> Reply-To: ph10@cam.ac.uk Cc: builduser Subject: Script Run Consistency This is a bug report for perl from ph10@cam.ac.uk, generated with the help of perlbug 1.41 running under perl 5.28.0. ----------------------------------------------------------------- [Please describe your issue here] I was running some tests on the new (*script_run:...) regex feature, preparatory to implementing it in PCRE. As I understand it from reading perlre, the ASCII digits 0-9 should be acceptable in any script run, provided there aren't any other digits. There seems to be some inconsistency. Consider these two examples: $ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }' yes >ぁ12ぁ< In this example, the two ASCII digits "12" are flanked by two Hiragana characters; the pattern matches. This is also true for many other scripts, including Greek, Cyrillic, Armenian, Hebrew, Arabic, Ethiopic, and Ogham. $ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }' no In this example, the two ASCII digits "12" are flanged by two Bengali characters; the pattern does not match. This is also true for Thaana, Thai, Khmer and Devanagari. Why the difference? I haven't exhaustively tested all possible scripts, and I haven't spotted any pattern in which ones match and which ones don't. Philip [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=core severity=medium --- Site configuration information for perl 5.28.0: Configured by builduser at Wed Aug 1 10:43:08 CEST 2018. Summary of my perl5 (revision 5 version 28 subversion 0) configuration: Platform: osname=linux osvers=4.17.11-arch1 archname=x86_64-linux-thread-multi uname='linux flo-64s 4.17.11-arch1 #1 smp preempt sun jul 29 10:11:16 utc 2018 x86_64 gnulinux ' config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.28/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.28/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.28/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now' hint=recommended useposix=true d_sigaction=define useithreads=define usemultiplicity=define use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='cc' ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt' cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='8.1.1 20180531' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='cc' ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.1/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64 libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc libc=libc-2.27.so so=so useshrplib=true libperl=libperl.so gnulibc_version='2.27' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.28/core_perl/CORE' cccdlflags='-fPIC' lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong' --- @INC for perl 5.28.0: /usr/lib/perl5/5.28/site_perl /usr/share/perl5/site_perl /usr/lib/perl5/5.28/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5/5.28/core_perl /usr/share/perl5/core_perl --- Environment for perl 5.28.0: HOME=/home/ph10 LANG=en_GB.utf8 LANGUAGE=en_GB.utf8 LC_ALL=C LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl PERL_BADLANG (unset) SHELL=/bin/bash
Date: Thu, 27 Sep 2018 19:51:08 +0200
Subject: Re: [perl #133547] Inconsistency in Script Run
To: perl5-porters [...] perl.org
From: Abigail <abigail [...] abigail.be>
Download (untitled) / with headers
text/plain 1.9k
On Thu, Sep 27, 2018 at 10:04:22AM -0700, Philip Hazel (via RT) wrote: Show quoted text
> # New Ticket Created by Philip Hazel > # Please include the string: [perl #133547] > # in the subject line of all future correspondence about this issue. > # <URL: https://rt.perl.org/Ticket/Display.html?id=133547 > > > > From: ph10@cam.ac.uk > To: perlbug@perl.org > Message-Id: <5.28.0_31268_1538066218@quercite> > Reply-To: ph10@cam.ac.uk > Cc: builduser > Subject: Script Run Consistency > > > This is a bug report for perl from ph10@cam.ac.uk, > generated with the help of perlbug 1.41 running under perl 5.28.0. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > I was running some tests on the new (*script_run:...) regex feature, > preparatory to implementing it in PCRE. As I understand it from reading perlre, > the ASCII digits 0-9 should be acceptable in any script run, provided there > aren't any other digits. There seems to be some inconsistency. Consider these > two examples: > > $ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }' > yes >ぁ12ぁ< > > In this example, the two ASCII digits "12" are flanked by two Hiragana > characters; the pattern matches. This is also true for many other scripts, > including Greek, Cyrillic, Armenian, Hebrew, Arabic, Ethiopic, and Ogham. > > $ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }' > no > > In this example, the two ASCII digits "12" are flanged by two Bengali > characters; the pattern does not match. This is also true for Thaana, Thai, > Khmer and Devanagari. > > Why the difference? I haven't exhaustively tested all possible scripts, and I > haven't spotted any pattern in which ones match and which ones don't. >
Can you check with blead? I reported this in August, and Karl fixed that the same day. So 5.28.0 is broken, but blead should do things correctly. Regards, Abigail
From: ph10 [...] hermes.cam.ac.uk
Date: Fri, 28 Sep 2018 08:33:54 +0100 (BST)
Subject: Re: [perl #133547] Inconsistency in Script Run
To: Abigail via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 448b
On Thu, 27 Sep 2018, Abigail via RT wrote: Show quoted text
> Can you check with blead?
Not without some research and learning how to do that. :-) But if I get some time I'll have a go. Show quoted text
> I reported this in August, and Karl fixed that the same day. So 5.28.0 > is broken, but blead should do things correctly.
I'm pleased to learn that it *is* a bug, and not some misunderstanding on my part. Thanks for the fast response. Regards, Philip -- Philip Hazel
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 861b
This is fixed by commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af Author: Karl Williamson <khw@cpan.org> Date: Sun Sep 30 10:38:02 2018 -0600 PATCH: [perl #133547]: script run broken All scripts can have the ASCII digits for their numbers. Scripts with their own digits can alternatively use those. Only one of these two sets can be used in a script run. The decision as to which set to use must be deferred until the first digit is encountered, as otherwise we don't know which set will be used. Prior to this commit, the decision was being made prematurely in some cases. As a result of this change, the non-ASCII-digits in the Common script need to be special-cased, and different criteria are used to decide if we need to look up whether a character is a digit or not. -- Karl Williamson
To: ph10 [...] hermes.cam.ac.uk, Abigail via RT <perlbug-followup [...] perl.org>
Date: Sun, 30 Sep 2018 11:22:01 -0600
Subject: Re: [perl #133547] Inconsistency in Script Run
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 817b
On 09/28/2018 01:33 AM, ph10@hermes.cam.ac.uk wrote: Show quoted text
> On Thu, 27 Sep 2018, Abigail via RT wrote: >
>> Can you check with blead?
> > Not without some research and learning how to do that. :-) But if I get > some time I'll have a go. >
>> I reported this in August, and Karl fixed that the same day. So 5.28.0 >> is broken, but blead should do things correctly.
> > I'm pleased to learn that it *is* a bug, and not some misunderstanding > on my part. Thanks for the fast response. > > Regards, > Philip >
The fix for this should be put in 5.28.1. perlre has been updated since 5.28.0 to make clearer the acceptable behavior of a run. Hopefully, if you had read it, you wouldn't have thought it was a misunderstanding: https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f
From: ph10 [...] hermes.cam.ac.uk
CC: Abigail via RT <perlbug-followup [...] perl.org>
Subject: Re: [perl #133547] Inconsistency in Script Run
Date: Mon, 1 Oct 2018 09:54:10 +0100 (BST)
To: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 449b
On Sun, 30 Sep 2018, Karl Williamson wrote: Show quoted text
> perlre has been updated since 5.28.0 to make clearer the acceptable behavior > of a run. Hopefully, if you had read it, you wouldn't have thought it was a > misunderstanding: > > https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f
Many thanks, Karl. That confirms what I had (finally :-) deduced myself, and is admirably clear. Regards, Philip -- Philip Hazel
To: Karl Williamson <public [...] khwilliamson.com>
Subject: Re: [perl #133547] Inconsistency in Script Run
Date: Tue, 2 Oct 2018 10:57:58 +0100 (BST)
From: ph10 [...] hermes.cam.ac.uk
CC: Abigail via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 947b
On Sun, 30 Sep 2018, Karl Williamson wrote: Show quoted text
> The fix for this should be put in 5.28.1.
I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that all the issues I previous reported are fixed. However, there are still two oddities that don't seem to be right. The digit sequences FF10..FF19 and 1D7CE..1D7FF (both in the Common script) don't seem to work as I expected them. A string containing them along with Latin characters is not valid as a script run in this testing Perl. Indeed, a string with only one of them and Latin characters doesn't match (which it surely should, regardless of being a digit, since it is in the Common script). Two of them on their own, without any Latin characters does match. These strings match the pattern /^(*sr:.{4})/ \x{ff10}\x{ff19}.. \x{1d7ce}\x{1d7cf},, These don't: A\x{ff10}\x{ff19}B A\x{ff10}BC A\x{1d7ce}\x{1d7cf}B A\x{1d7ce}BC Regards, Philip -- Philip Hazel
From: Karl Williamson <public [...] khwilliamson.com>
CC: Abigail via RT <perlbug-followup [...] perl.org>
To: ph10 [...] hermes.cam.ac.uk
Subject: Re: [perl #133547] Inconsistency in Script Run
Date: Tue, 2 Oct 2018 08:08:21 -0600
Download (untitled) / with headers
text/plain 1.8k
On 10/02/2018 03:57 AM, ph10@hermes.cam.ac.uk wrote: Show quoted text
> On Sun, 30 Sep 2018, Karl Williamson wrote: >
>> The fix for this should be put in 5.28.1.
> > I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that > all the issues I previous reported are fixed. However, there are still > two oddities that don't seem to be right. The digit sequences FF10..FF19 > and 1D7CE..1D7FF (both in the Common script) don't seem to work as I > expected them. A string containing them along with Latin characters is > not valid as a script run in this testing Perl. Indeed, a string with > only one of them and Latin characters doesn't match (which it surely > should, regardless of being a digit, since it is in the Common script). > Two of them on their own, without any Latin characters does match. > > These strings match the pattern /^(*sr:.{4})/ > > \x{ff10}\x{ff19}.. > \x{1d7ce}\x{1d7cf},, > > These don't: > > A\x{ff10}\x{ff19}B > A\x{ff10}BC > A\x{1d7ce}\x{1d7cf}B > A\x{1d7ce}BC
Technically, this isn't a bug, but a design flaw. My design was to allow only ASCII 0-9 to be allowed with other scripts. Your second batch of cases here are in the Latin script, and therefore the only digits from Common that are allowed are the ASCII ones. But that is not what a reasonable person would expect, and so the design is wrong. I see two choices: 1) Allow the non-ASCII digits that are considered Common to match the Latin script 2) Allow these to match any script, just like the ASCII ones already do. The second solution seems more in keeping with Unicode's intent, since they made these digits Common, so should be allowed in multiple scripts. But the requirement that all digits in a run must come from the same sequence of 10 would remain. I'm open to hearing arguments either way, or some third way.
Date: Tue, 2 Oct 2018 16:12:29 +0100 (BST)
Subject: Re: [perl #133547] Inconsistency in Script Run
To: Karl Williamson <public [...] khwilliamson.com>
From: ph10 [...] hermes.cam.ac.uk
CC: Abigail via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 528b
On Sun, 30 Sep 2018, Karl Williamson wrote: Show quoted text
> perlre has been updated since 5.28.0 to make clearer the acceptable behavior > of a run.
Sorry to nag you again, but have I got the following right? Perl allows a Common or Inherited character in a script run only if its Script Extension property lists the script of other characters in the run, or if it doesn't figure in the Extensions file. Example: the longest script run in "AB\x{1cf7}" is "AB", even though 1cf7 is a Common character. Regards, Philip -- Philip Hazel
From: ph10 [...] hermes.cam.ac.uk
CC: Abigail via RT <perlbug-followup [...] perl.org>
Subject: Re: [perl #133547] Inconsistency in Script Run
Date: Tue, 2 Oct 2018 16:15:16 +0100 (BST)
To: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 563b
On Tue, 2 Oct 2018, Karl Williamson wrote: Show quoted text
> Technically, this isn't a bug, but a design flaw.
Nice distinction! :-) Show quoted text
> 2) Allow these to match any script, just like the ASCII ones already do.
That is what I expected, and what I have tentatively implemented. Show quoted text
> The second solution seems more in keeping with Unicode's intent, since they > made these digits Common, so should be allowed in multiple scripts. But the > requirement that all digits in a run must come from the same sequence of 10 > would remain.
Yes, indeed. Regards, Philip -- Philip Hazel
From: Karl Williamson <public [...] khwilliamson.com>
CC: Abigail via RT <perlbug-followup [...] perl.org>
To: ph10 [...] hermes.cam.ac.uk
Date: Tue, 2 Oct 2018 09:57:37 -0600
Subject: Re: [perl #133547] Inconsistency in Script Run
Download (untitled) / with headers
text/plain 738b
On 10/02/2018 09:12 AM, ph10@hermes.cam.ac.uk wrote: Show quoted text
> On Sun, 30 Sep 2018, Karl Williamson wrote: >
>> perlre has been updated since 5.28.0 to make clearer the acceptable behavior >> of a run.
> > Sorry to nag you again, but have I got the following right? Perl allows > a Common or Inherited character in a script run only if its Script > Extension property lists the script of other characters in the run, or > if it doesn't figure in the Extensions file. Example: the longest script > run in "AB\x{1cf7}" is "AB", even though 1cf7 is a Common character. >
1cf7 is not a Common character in the Script Extensions property, and so yes the longest script run in that string is AB. The Script property is irrelevant to script runs.
CC: Abigail via RT <perlbug-followup [...] perl.org>
From: ph10 [...] hermes.cam.ac.uk
Date: Tue, 2 Oct 2018 17:10:45 +0100 (BST)
Subject: Re: [perl #133547] Inconsistency in Script Run
To: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 598b
On Tue, 2 Oct 2018, Karl Williamson wrote: Show quoted text
> 1cf7 is not a Common character in the Script Extensions property, and so yes > the longest script run in that string is AB. The Script property is > irrelevant to script runs.
I must be misunderstanding something. I do not see the word "common" anywhere in the ScriptExtensions.txt file. Ah! It *is* the script extensions property for characters that are not mentioned whose Script property is Common. Is that it? (This is proving to be much more complicated that I expected. :-) Thanks for putting up with me. Regards, Philip -- Philip Hazel
Date: Tue, 2 Oct 2018 15:39:56 -0600
Subject: Re: [perl #133547] Inconsistency in Script Run
To: ph10 [...] hermes.cam.ac.uk
From: Karl Williamson <public [...] khwilliamson.com>
CC: Abigail via RT <perlbug-followup [...] perl.org>
Download (untitled) / with headers
text/plain 917b
On 10/02/2018 10:10 AM, ph10@hermes.cam.ac.uk wrote: Show quoted text
> On Tue, 2 Oct 2018, Karl Williamson wrote: >
>> 1cf7 is not a Common character in the Script Extensions property, and so yes >> the longest script run in that string is AB. The Script property is >> irrelevant to script runs.
> > I must be misunderstanding something. I do not see the word "common" > anywhere in the ScriptExtensions.txt file. Ah! It *is* the script > extensions property for characters that are not mentioned whose Script > property is Common. Is that it? (This is proving to be much more > complicated that I expected. :-) Thanks for putting up with me.
The top of ScriptExtensions.txt says: # All code points not explicitly listed for Script_Extensions # have as their value the corresponding Script property value The way mktables creates scx is to create a copy of sc, and then override the entries that are in ScriptExtensions.txt.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 395b
On Sun, 30 Sep 2018 09:51:12 -0700, khw wrote: Show quoted text
> This is fixed by > > commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af > Author: Karl Williamson <khw@cpan.org> > Date: Sun Sep 30 10:38:02 2018 -0600 >
Karl, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 410b
I have now applied: commit f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 Author: Karl Williamson <khw@cpan.org> Date: Thu Mar 14 11:48:11 2019 -0600 Any Common digit set can match in any script This fixes a design flaw in script runs that in 5.30 effectively prevented digits from the Common script except the ASCII [0-9] from being in any meaningful script run. -- Karl Williamson
Subject: Re: [perl #133547] Inconsistency in Script Run
CC: perl5-porters [...] perl.org
Date: Thu, 14 Mar 2019 13:28:02 -0600
To: perlbug-followup [...] perl.org
From: Karl Williamson <public [...] khwilliamson.com>
Download (untitled) / with headers
text/plain 774b
On 1/9/19 11:16 AM, Steve Hay via RT wrote: Show quoted text
> On Sun, 30 Sep 2018 09:51:12 -0700, khw wrote:
>> This is fixed by >> >> commit 393e5a4585b92e635cfc4eee34da8f86f3bfd2af >> Author: Karl Williamson <khw@cpan.org> >> Date: Sun Sep 30 10:38:02 2018 -0600 >>
> > Karl, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.
I didn't do this because of the design flaw in 5.30 this ticket showed. That has now been fixed by f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 which I don't know if is suitable for back porting or not,. Show quoted text
> > --- > via perlbug: queue: perl5 status: pending release > https://rt.perl.org/Ticket/Display.html?id=133547 >


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org