Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in Script Run #16704

Closed
p5pRT opened this issue Sep 27, 2018 · 20 comments
Closed

Inconsistency in Script Run #16704

p5pRT opened this issue Sep 27, 2018 · 20 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 27, 2018

Migrated from rt.perl.org#133547 (status was 'resolved')

Searchable as RT133547$

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2018

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

I was running some tests on the new (*script_run​:...) regex feature,
preparatory to implementing it in PCRE. As I understand it from reading perlre,
the ASCII digits 0-9 should be acceptable in any script run, provided there
aren't any other digits. There seems to be some inconsistency. Consider these
two examples​:

$ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr​:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }'
yes >ぁ12ぁ<

In this example, the two ASCII digits "12" are flanked by two Hiragana
characters; the pattern matches. This is also true for many other scripts,
including Greek, Cyrillic, Armenian, Hebrew, Arabic, Ethiopic, and Ogham.

$ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr​:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }'
no

In this example, the two ASCII digits "12" are flanged by two Bengali
characters; the pattern does not match. This is also true for Thaana, Thai,
Khmer and Devanagari.

Why the difference? I haven't exhaustively tested all possible scripts, and I
haven't spotted any pattern in which ones match and which ones don't.

Philip

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.28.0:

Configured by builduser at Wed Aug  1 10:43:08 CEST 2018.

Summary of my perl5 (revision 5 version 28 subversion 0) configuration:
   
  Platform:
    osname=linux
    osvers=4.17.11-arch1
    archname=x86_64-linux-thread-multi
    uname='linux flo-64s 4.17.11-arch1 #1 smp preempt sun jul 29 10:11:16 utc 2018 x86_64 gnulinux '
    config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.28/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.28/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.28/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=define
    usemultiplicity=define
    use64bitint=define
    use64bitall=define
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
    bincompat5005=undef
  Compiler:
    cc='cc'
    ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'
    optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt'
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='8.1.1 20180531'
    gccosandvers=''
    intsize=4
    longsize=8
    ptrsize=8
    doublesize=8
    byteorder=12345678
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=16
    longdblkind=3
    ivtype='long'
    ivsize=8
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=8
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/8.1.1/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.27.so
    so=so
    useshrplib=true
    libperl=libperl.so
    gnulibc_version='2.27'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.28/core_perl/CORE'
    cccdlflags='-fPIC'
    lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong'



@INC for perl 5.28.0:
    /usr/lib/perl5/5.28/site_perl
    /usr/share/perl5/site_perl
    /usr/lib/perl5/5.28/vendor_perl
    /usr/share/perl5/vendor_perl
    /usr/lib/perl5/5.28/core_perl
    /usr/share/perl5/core_perl


Environment for perl 5.28.0:
    HOME=/home/ph10
    LANG=en_GB.utf8
    LANGUAGE=en_GB.utf8
    LC_ALL=C
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2018

From @Abigail

On Thu, Sep 27, 2018 at 10​:04​:22AM -0700, Philip Hazel (via RT) wrote​:

# New Ticket Created by Philip Hazel
# Please include the string​: [perl #133547]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133547 >

From​: ph10@​cam.ac.uk
To​: perlbug@​perl.org
Message-Id​: <5.28.0_31268_1538066218@​quercite>
Reply-To​: ph10@​cam.ac.uk
Cc​: builduser
Subject​: Script Run Consistency

This is a bug report for perl from ph10@​cam.ac.uk,
generated with the help of perlbug 1.41 running under perl 5.28.0.

-----------------------------------------------------------------
[Please describe your issue here]

I was running some tests on the new (*script_run​:...) regex feature,
preparatory to implementing it in PCRE. As I understand it from reading perlre,
the ASCII digits 0-9 should be acceptable in any script run, provided there
aren't any other digits. There seems to be some inconsistency. Consider these
two examples​:

$ perl -e 'if ("\x{3041}12\x{3041}" =~ /^(*sr​:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }'
yes >ぁ12ぁ<

In this example, the two ASCII digits "12" are flanked by two Hiragana
characters; the pattern matches. This is also true for many other scripts,
including Greek, Cyrillic, Armenian, Hebrew, Arabic, Ethiopic, and Ogham.

$ perl -e 'if ("\x{0980}12\x{0993}" =~ /^(*sr​:.{4})/) { print "yes >$&<\n"; } else { print "no \n"; }'
no

In this example, the two ASCII digits "12" are flanged by two Bengali
characters; the pattern does not match. This is also true for Thaana, Thai,
Khmer and Devanagari.

Why the difference? I haven't exhaustively tested all possible scripts, and I
haven't spotted any pattern in which ones match and which ones don't.

Can you check with blead? I reported this in August, and Karl fixed
that the same day. So 5.28.0 is broken, but blead should do things
correctly.

Regards,

Abigail

@p5pRT
Copy link
Author

p5pRT commented Sep 27, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 28, 2018

From ph10@hermes.cam.ac.uk

On Thu, 27 Sep 2018, Abigail via RT wrote​:

Can you check with blead?

Not without some research and learning how to do that. :-) But if I get
some time I'll have a go.

I reported this in August, and Karl fixed that the same day. So 5.28.0
is broken, but blead should do things correctly.

I'm pleased to learn that it *is* a bug, and not some misunderstanding
on my part. Thanks for the fast response.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2018

From @khwilliamson

This is fixed by

commit 393e5a4
Author​: Karl Williamson <khw@​cpan.org>
Date​: Sun Sep 30 10​:38​:02 2018 -0600

  PATCH​: [perl #133547]​: script run broken
 
  All scripts can have the ASCII digits for their numbers. Scripts with
  their own digits can alternatively use those. Only one of these two
  sets can be used in a script run. The decision as to which set to use
  must be deferred until the first digit is encountered, as otherwise we
  don't know which set will be used. Prior to this commit, the decision
  was being made prematurely in some cases. As a result of this change,
  the non-ASCII-digits in the Common script need to be special-cased, and
  different criteria are used to decide if we need to look up whether a
  character is a digit or not.

--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2018

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Author

p5pRT commented Sep 30, 2018

From @khwilliamson

On 09/28/2018 01​:33 AM, ph10@​hermes.cam.ac.uk wrote​:

On Thu, 27 Sep 2018, Abigail via RT wrote​:

Can you check with blead?

Not without some research and learning how to do that. :-) But if I get
some time I'll have a go.

I reported this in August, and Karl fixed that the same day. So 5.28.0
is broken, but blead should do things correctly.

I'm pleased to learn that it *is* a bug, and not some misunderstanding
on my part. Thanks for the fast response.

Regards,
Philip

The fix for this should be put in 5.28.1.

perlre has been updated since 5.28.0 to make clearer the acceptable
behavior of a run. Hopefully, if you had read it, you wouldn't have
thought it was a misunderstanding​:

https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f

@p5pRT
Copy link
Author

p5pRT commented Oct 1, 2018

From ph10@hermes.cam.ac.uk

On Sun, 30 Sep 2018, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior
of a run. Hopefully, if you had read it, you wouldn't have thought it was a
misunderstanding​:

https://perl5.git.perl.org/perl.git/commitdiff/4a1d964056983f26f5646fdb7aadb4b5e7b5235f

Many thanks, Karl. That confirms what I had (finally :-) deduced myself,
and is admirably clear.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From ph10@hermes.cam.ac.uk

On Sun, 30 Sep 2018, Karl Williamson wrote​:

The fix for this should be put in 5.28.1.

I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that
all the issues I previous reported are fixed. However, there are still
two oddities that don't seem to be right. The digit sequences FF10..FF19
and 1D7CE..1D7FF (both in the Common script) don't seem to work as I
expected them. A string containing them along with Latin characters is
not valid as a script run in this testing Perl. Indeed, a string with
only one of them and Latin characters doesn't match (which it surely
should, regardless of being a digit, since it is in the Common script).
Two of them on their own, without any Latin characters does match.

These strings match the pattern /^(*sr​:.{4})/

  \x{ff10}\x{ff19}..
  \x{1d7ce}\x{1d7cf},,
 
These don't​:

  A\x{ff10}\x{ff19}B
  A\x{ff10}BC
  A\x{1d7ce}\x{1d7cf}B
  A\x{1d7ce}BC

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From @khwilliamson

On 10/02/2018 03​:57 AM, ph10@​hermes.cam.ac.uk wrote​:

On Sun, 30 Sep 2018, Karl Williamson wrote​:

The fix for this should be put in 5.28.1.

I have downloaded v5.29.4 (v5.29.3-35-g4288c5b93b) and can confirm that
all the issues I previous reported are fixed. However, there are still
two oddities that don't seem to be right. The digit sequences FF10..FF19
and 1D7CE..1D7FF (both in the Common script) don't seem to work as I
expected them. A string containing them along with Latin characters is
not valid as a script run in this testing Perl. Indeed, a string with
only one of them and Latin characters doesn't match (which it surely
should, regardless of being a digit, since it is in the Common script).
Two of them on their own, without any Latin characters does match.

These strings match the pattern /^(*sr​:.{4})/

\x{ff10}\x{ff19}..
\x{1d7ce}\x{1d7cf},,

These don't​:

A\x{ff10}\x{ff19}B
A\x{ff10}BC
A\x{1d7ce}\x{1d7cf}B
A\x{1d7ce}BC

Technically, this isn't a bug, but a design flaw.

My design was to allow only ASCII 0-9 to be allowed with other scripts.
Your second batch of cases here are in the Latin script, and therefore
the only digits from Common that are allowed are the ASCII ones.

But that is not what a reasonable person would expect, and so the design
is wrong.

I see two choices​:

1) Allow the non-ASCII digits that are considered Common to match the
Latin script

2) Allow these to match any script, just like the ASCII ones already do.

The second solution seems more in keeping with Unicode's intent, since
they made these digits Common, so should be allowed in multiple scripts.
  But the requirement that all digits in a run must come from the same
sequence of 10 would remain.

I'm open to hearing arguments either way, or some third way.

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From ph10@hermes.cam.ac.uk

On Sun, 30 Sep 2018, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior
of a run.

Sorry to nag you again, but have I got the following right? Perl allows
a Common or Inherited character in a script run only if its Script
Extension property lists the script of other characters in the run, or
if it doesn't figure in the Extensions file. Example​: the longest script
run in "AB\x{1cf7}" is "AB", even though 1cf7 is a Common character.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From ph10@hermes.cam.ac.uk

On Tue, 2 Oct 2018, Karl Williamson wrote​:

Technically, this isn't a bug, but a design flaw.

Nice distinction! :-)

2) Allow these to match any script, just like the ASCII ones already do.

That is what I expected, and what I have tentatively implemented.

The second solution seems more in keeping with Unicode's intent, since they
made these digits Common, so should be allowed in multiple scripts. But the
requirement that all digits in a run must come from the same sequence of 10
would remain.

Yes, indeed.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From @khwilliamson

On 10/02/2018 09​:12 AM, ph10@​hermes.cam.ac.uk wrote​:

On Sun, 30 Sep 2018, Karl Williamson wrote​:

perlre has been updated since 5.28.0 to make clearer the acceptable behavior
of a run.

Sorry to nag you again, but have I got the following right? Perl allows
a Common or Inherited character in a script run only if its Script
Extension property lists the script of other characters in the run, or
if it doesn't figure in the Extensions file. Example​: the longest script
run in "AB\x{1cf7}" is "AB", even though 1cf7 is a Common character.

1cf7 is not a Common character in the Script Extensions property, and so
yes the longest script run in that string is AB. The Script property is
irrelevant to script runs.

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From ph10@hermes.cam.ac.uk

On Tue, 2 Oct 2018, Karl Williamson wrote​:

1cf7 is not a Common character in the Script Extensions property, and so yes
the longest script run in that string is AB. The Script property is
irrelevant to script runs.

I must be misunderstanding something. I do not see the word "common"
anywhere in the ScriptExtensions.txt file. Ah! It *is* the script
extensions property for characters that are not mentioned whose Script
property is Common. Is that it? (This is proving to be much more
complicated that I expected. :-) Thanks for putting up with me.

Regards,
Philip

--
Philip Hazel

@p5pRT
Copy link
Author

p5pRT commented Oct 2, 2018

From @khwilliamson

On 10/02/2018 10​:10 AM, ph10@​hermes.cam.ac.uk wrote​:

On Tue, 2 Oct 2018, Karl Williamson wrote​:

1cf7 is not a Common character in the Script Extensions property, and so yes
the longest script run in that string is AB. The Script property is
irrelevant to script runs.

I must be misunderstanding something. I do not see the word "common"
anywhere in the ScriptExtensions.txt file. Ah! It *is* the script
extensions property for characters that are not mentioned whose Script
property is Common. Is that it? (This is proving to be much more
complicated that I expected. :-) Thanks for putting up with me.

The top of ScriptExtensions.txt says​:

# All code points not explicitly listed for Script_Extensions
# have as their value the corresponding Script property value

The way mktables creates scx is to create a copy of sc, and then
override the entries that are in ScriptExtensions.txt.

@p5pRT
Copy link
Author

p5pRT commented Jan 9, 2019

From @steve-m-hay

On Sun, 30 Sep 2018 09​:51​:12 -0700, khw wrote​:

This is fixed by

commit 393e5a4
Author​: Karl Williamson <khw@​cpan.org>
Date​: Sun Sep 30 10​:38​:02 2018 -0600

Karl, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2019

From @khwilliamson

I have now applied​:
commit f4e61fc
Author​: Karl Williamson <khw@​cpan.org>
Date​: Thu Mar 14 11​:48​:11 2019 -0600

  Any Common digit set can match in any script
 
  This fixes a design flaw in script runs that in 5.30 effectively
  prevented digits from the Common script except the ASCII [0-9] from
  being in any meaningful script run.
--
Karl Williamson

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2019

From @khwilliamson

On 1/9/19 11​:16 AM, Steve Hay via RT wrote​:

On Sun, 30 Sep 2018 09​:51​:12 -0700, khw wrote​:

This is fixed by

commit 393e5a4
Author​: Karl Williamson <khw@​cpan.org>
Date​: Sun Sep 30 10​:38​:02 2018 -0600

Karl, is there any chance you could prepare a patch for applying to maint-5.28? It doesn't cherry-pick cleanly and I think you're probably better placed than me to resolve the conflicts.

I didn't do this because of the design flaw in 5.30 this ticket showed.
That has now been fixed by f4e61fc
which I don't know if is suitable for back porting or not,.

---
via perlbug​: queue​: perl5 status​: pending release
https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133547

@p5pRT
Copy link
Author

p5pRT commented May 22, 2019

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.30.0, this and 160 other issues have been
resolved.

Perl 5.30.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.30.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT
Copy link
Author

p5pRT commented May 22, 2019

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant