Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case folding of ß 00DF ß LATIN SMALL LETTER SHARP S #3352

Closed
p6rt opened this issue Mar 4, 2014 · 11 comments
Closed

Case folding of ß 00DF ß LATIN SMALL LETTER SHARP S #3352

p6rt opened this issue Mar 4, 2014 · 11 comments

Comments

@p6rt
Copy link

p6rt commented Mar 4, 2014

Migrated from rt.perl.org#121377 (status was 'resolved')

Searchable as RT121377$

@p6rt
Copy link
Author

p6rt commented Jun 15, 2010

From @masak

<sorear> How does case insensitive matching work in perl 6?
<sorear> e.g. "ß" ~~ m​:i/SS/
<masak> sorear​: that's the syntax, so I assume you're asking about the
semantics.
<masak> oh wait, that example is tricky :)
<masak> I would be surprised if Perl 6 is spec'd to handle that.
<sorear> yes. semantics, and dark corners thereof
<sorear> yes, S05 says exactly nothing on the subject
<sorear> other than "ignores case distinctions"
<moritz_> rakudo​: say "ß" ~~ /​:i SS/
<p6eval> rakudo cfbeb5​: OUTPUT«␤»
<moritz_> rakudo​: say uc "ß"
<p6eval> rakudo cfbeb5​: OUTPUT«SS␤»
<masak> o.O
<masak> German is strange.
<moritz_> it sure is.
<moritz_> masak​: want to submit a bug report about inconsistency?
* masak submits rakudobug

@p6rt
Copy link
Author

p6rt commented Aug 11, 2010

@coke - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Oct 21, 2012

From @coke

On Tue Jun 15 02​:16​:28 2010, masak wrote​:

<sorear> How does case insensitive matching work in perl 6?
<sorear> e.g. "ß" ~~ m​:i/SS/
<masak> sorear​: that's the syntax, so I assume you're asking about the
semantics.
<masak> oh wait, that example is tricky :)
<masak> I would be surprised if Perl 6 is spec'd to handle that.
<sorear> yes. semantics, and dark corners thereof
<sorear> yes, S05 says exactly nothing on the subject
<sorear> other than "ignores case distinctions"
<moritz_> rakudo​: say "ß" ~~ /​:i SS/
<p6eval> rakudo cfbeb5​: OUTPUT«␤»
<moritz_> rakudo​: say uc "ß"
<p6eval> rakudo cfbeb5​: OUTPUT«SS␤»
<masak> o.O
<masak> German is strange.
<moritz_> it sure is.
<moritz_> masak​: want to submit a bug report about inconsistency?
* masak submits rakudobug

Behavior changed​:

"ß" ~~ m​:i/SS/
#<failed match>
say uc "ß"
ß

Closable?

--
Will "Coke" Coleda

@p6rt
Copy link
Author

p6rt commented Oct 21, 2012

From @masak

On Sat Oct 20 18​:17​:19 2012, coke wrote​:

On Tue Jun 15 02​:16​:28 2010, masak wrote​:

<sorear> How does case insensitive matching work in perl 6?
<sorear> e.g. "ß" ~~ m​:i/SS/
<masak> sorear​: that's the syntax, so I assume you're asking about
the
semantics.
<masak> oh wait, that example is tricky :)
<masak> I would be surprised if Perl 6 is spec'd to handle that.
<sorear> yes. semantics, and dark corners thereof
<sorear> yes, S05 says exactly nothing on the subject
<sorear> other than "ignores case distinctions"
<moritz_> rakudo​: say "ß" ~~ /​:i SS/
<p6eval> rakudo cfbeb5​: OUTPUT«␤»
<moritz_> rakudo​: say uc "ß"
<p6eval> rakudo cfbeb5​: OUTPUT«SS␤»
<masak> o.O
<masak> German is strange.
<moritz_> it sure is.
<moritz_> masak​: want to submit a bug report about inconsistency?
* masak submits rakudobug

Behavior changed​:

"ß" ~~ m​:i/SS/
#<failed match>
say uc "ß"
ß

Closable?

Well, the *inconsistency* seems to be gone... but by pushing the
semantics in (what I consider to be) the wrong direction. I.e. now
instead of one of two things behaving the wrong way, both do.

@p6rt
Copy link
Author

p6rt commented Jul 19, 2013

From @ShimmerFairy

<lue> r​: say "ß".uc
<camelia> rakudo 45d447​: OUTPUT«ß␤»
<lue> r​: say "ẞ".lc.uc
<camelia> rakudo 45d447​: OUTPUT«SS␤»

Both examples above are meant to result in SS. Note that the capital
eszett does convert to a lowercase one​:

<lue> r​: say "ẞ".lc
<camelia> rakudo 45d447​: OUTPUT«ß␤»

@p6rt
Copy link
Author

p6rt commented Mar 4, 2014

From @moritz

<moritz> p6​: say 'ß'.uc, 'ß'.tc, 'ß'.tclc
<camelia> rakudo-jvm f2471a​: OUTPUT«SSSSß␤»
<camelia> ..rakudo-parrot f2471a, rakudo-moar f2471a​: OUTPUT«ßßß␤»
<camelia> ..niecza v24-109-g48a8de3​: OUTPUT«ßSsSs␤»

All these answers are wrong. 'ß'.uc is supposed to be 'SS' or possibly
'ẞ', and 'ß'.tc and 'ß'.tclc should both be 'Ss'

@p6rt
Copy link
Author

p6rt commented Mar 5, 2014

From @coke

On Tue Mar 04 12​:56​:48 2014, moritz wrote​:

<moritz> p6​: say 'ß'.uc, 'ß'.tc, 'ß'.tclc
<camelia> rakudo-jvm f2471a​: OUTPUT«SSSSß␤»
<camelia> ..rakudo-parrot f2471a, rakudo-moar f2471a​: OUTPUT«ßßß␤»
<camelia> ..niecza v24-109-g48a8de3​: OUTPUT«ßSsSs␤»

All these answers are wrong. 'ß'.uc is supposed to be 'SS' or possibly
'ẞ', and 'ß'.tc and 'ß'.tclc should both be 'Ss'

Is this a unicode specified behavior (if so, can we have a URL for posterity?) or is this a native speaker response which contradicts unicode?

What's the desired behavior if ß is not at the beginning of the string?

There are already tests for this behavior in S32-str/{uc,tclc,tc}.t which might need to be cleaned up as a result of this test.
--
Will "Coke" Coleda

@p6rt
Copy link
Author

p6rt commented Mar 5, 2014

The RT System itself - Status changed from 'new' to 'open'

@p6rt
Copy link
Author

p6rt commented Mar 28, 2014

From @moritz

On Wed Mar 05 06​:28​:49 2014, coke wrote​:

On Tue Mar 04 12​:56​:48 2014, moritz wrote​:

<moritz> p6​: say 'ß'.uc, 'ß'.tc, 'ß'.tclc
<camelia> rakudo-jvm f2471a​: OUTPUT«SSSSß␤»
<camelia> ..rakudo-parrot f2471a, rakudo-moar f2471a​: OUTPUT«ßßß␤»
<camelia> ..niecza v24-109-g48a8de3​: OUTPUT«ßSsSs␤»

All these answers are wrong. 'ß'.uc is supposed to be 'SS' or
possibly
'ẞ', and 'ß'.tc and 'ß'.tclc should both be 'Ss'

Is this a unicode specified behavior (if so, can we have a URL for
posterity?)

Yes. http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf refers to SpecialCasing.txt, and SpecialCasing.txt contains this​:

==
# Format
# ==============================================================================

# The entries in this file are in the following machine-readable format​:
#
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
#
# <code>, <lower>, <title>, and <upper> provide character values in hex. If ther
e is more
# than one character, they are separated by spaces. Other than as used to separa
te
# elements, spaces are to be ignored.

[...]
# The German es-zed is special--the normal mapping is to SS.
# Note​: the titlecase should never occur in practice. It is equal to titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

or is this a native speaker response which contradicts
unicode?

It's in line with both the expectations of one native speaker (me) and with Unicode.

What's the desired behavior if ß is not at the beginning of the
string?

.uc should fold it to 'SS' regardless, and .tclc and .lc should leave it alone

There are already tests for this behavior in S32-str/{uc,tclc,tc}.t
which might need to be cleaned up as a result of this test.

So far that tests that I've seen seem to agree with me, but I haven't looked at all of them yet.

@p6rt
Copy link
Author

p6rt commented Oct 9, 2015

From @jnthn

On Fri Mar 28 09​:18​:06 2014, moritz wrote​:

On Wed Mar 05 06​:28​:49 2014, coke wrote​:

On Tue Mar 04 12​:56​:48 2014, moritz wrote​:

<moritz> p6​: say 'ß'.uc, 'ß'.tc, 'ß'.tclc
<camelia> rakudo-jvm f2471a​: OUTPUT«SSSSß␤»
<camelia> ..rakudo-parrot f2471a, rakudo-moar f2471a​: OUTPUT«ßßß␤»
<camelia> ..niecza v24-109-g48a8de3​: OUTPUT«ßSsSs␤»

All these answers are wrong. 'ß'.uc is supposed to be 'SS' or
possibly
'ẞ', and 'ß'.tc and 'ß'.tclc should both be 'Ss'

Is this a unicode specified behavior (if so, can we have a URL for
posterity?)

Yes. http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf refers to
SpecialCasing.txt, and SpecialCasing.txt contains this​:

==
# Format
#

==
# The entries in this file are in the following machine-readable
format​:
#
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? #
<comment>
#
# <code>, <lower>, <title>, and <upper> provide character values in
hex. If ther
e is more
# than one character, they are separated by spaces. Other than as used
to separa
te
# elements, spaces are to be ignored.

[...]
# The German es-zed is special--the normal mapping is to SS.
# Note​: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

This is now implemented, and tests unfudged.

@p6rt p6rt closed this as completed Oct 9, 2015
@p6rt
Copy link
Author

p6rt commented Oct 9, 2015

@jnthn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant