Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with thousands separators in perllocale #16686

Open
p5pRT opened this issue Sep 8, 2018 · 11 comments
Open

Issues with thousands separators in perllocale #16686

p5pRT opened this issue Sep 8, 2018 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Sep 8, 2018

Migrated from rt.perl.org#133504 (status was 'open')

Searchable as RT133504$

@p5pRT
Copy link
Author

p5pRT commented Sep 8, 2018

From rdiezmail-perl@yahoo.de

Adding thousands separators to a number is a rather common operation. Most languages use a printf format string like "%'d", but Perl does not support it. I suggest that you add this feature and be done with it.

Otherwise, searching the Internet will throw up a few alternatives with different degrees of trouble. The following official Perl documentation page has example code for that task​:

https://perldoc.perl.org/perllocale.html

Search for "Format command line params for current locale" in that page to find the example code I am talking about.

My first impression when looking at that example was​: A "1 while" loop with a cryptic regex inside... what an alarming Perl nerdiness sign...

A traditional loop without regular expressions is easier to understand. Search for "This is for manually testing of AddThousandsSeparators()" in the following Perl script I recently wrote for an alternative implementation.

https://github.com/rdiez/Tools/blob/master/AnnotateWithTimestamps/AnnotateWithTimestamps.pl

I have done a quick benchmark, and my loop version is much faster, at least with Perl v5.22.1.

One bad practice that stands out in the perllocale example is the usage of 2 variables with the same name but with different types, namely $grouping and @​grouping. Furthermore, I would not encourage implicit usage of special variable $_ anymore. Using named variables makes the code easier to grasp.

But more seriously, the perllocale example code does not work correctly with the Spanish locale, where the thousands separator is the period ('.') character, probably because that character has a special meaning in regular expressions. I guess some quoting is necessary.

@p5pRT
Copy link
Author

p5pRT commented Sep 8, 2018

From @jkeenan

On Sat, 08 Sep 2018 11​:00​:00 GMT, rdiezmail-perl@​yahoo.de wrote​:

Adding thousands separators to a number is a rather common operation.
Most languages use a printf format string like "%'d", but Perl does
not support it. I suggest that you add this feature and be done with
it.

Otherwise, searching the Internet will throw up a few alternatives
with different degrees of trouble. The following official Perl
documentation page has example code for that task​:

https://perldoc.perl.org/perllocale.html

Search for "Format command line params for current locale" in that
page to find the example code I am talking about.

My first impression when looking at that example was​: A "1 while" loop
with a cryptic regex inside... what an alarming Perl nerdiness sign...

A traditional loop without regular expressions is easier to
understand. Search for "This is for manually testing of
AddThousandsSeparators()" in the following Perl script I recently
wrote for an alternative implementation.

https://github.com/rdiez/Tools/blob/master/AnnotateWithTimestamps/AnnotateWithTimestamps.pl

I have done a quick benchmark, and my loop version is much faster, at
least with Perl v5.22.1.

One bad practice that stands out in the perllocale example is the
usage of 2 variables with the same name but with different types,
namely $grouping and @​grouping. Furthermore, I would not encourage
implicit usage of special variable $_ anymore. Using named variables
makes the code easier to grasp.

But more seriously, the perllocale example code does not work
correctly with the Spanish locale, where the thousands separator is
the period ('.') character, probably because that character has a
special meaning in regular expressions. I guess some quoting is
necessary.

There are many locales which in some sense are "Spanish". For example, on Ubuntu​:

#####
$ grep ^es_ /usr/share/i18n/SUPPORTED
es_AR.UTF-8 UTF-8
es_AR ISO-8859-1
es_BO.UTF-8 UTF-8
es_BO ISO-8859-1
es_CL.UTF-8 UTF-8
es_CL ISO-8859-1
es_CO.UTF-8 UTF-8
es_CO ISO-8859-1
es_CR.UTF-8 UTF-8
es_CR ISO-8859-1
es_CU UTF-8
es_DO.UTF-8 UTF-8
es_DO ISO-8859-1
es_EC.UTF-8 UTF-8
es_EC ISO-8859-1
es_ES.UTF-8 UTF-8
es_ES ISO-8859-1
es_ES@​euro ISO-8859-15
es_GT.UTF-8 UTF-8
es_GT ISO-8859-1
es_HN.UTF-8 UTF-8
es_HN ISO-8859-1
es_MX.UTF-8 UTF-8
es_MX ISO-8859-1
es_NI.UTF-8 UTF-8
es_NI ISO-8859-1
es_PA.UTF-8 UTF-8
es_PA ISO-8859-1
es_PE.UTF-8 UTF-8
es_PE ISO-8859-1
es_PR.UTF-8 UTF-8
es_PR ISO-8859-1
es_PY.UTF-8 UTF-8
es_PY ISO-8859-1
es_SV.UTF-8 UTF-8
es_SV ISO-8859-1
es_US.UTF-8 UTF-8
es_US ISO-8859-1
es_UY.UTF-8 UTF-8
es_UY ISO-8859-1
es_VE.UTF-8 UTF-8
es_VE ISO-8859-1
#####

Can you be more specific?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Sep 8, 2018

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Sep 8, 2018

From @jkeenan

On Sat, 08 Sep 2018 11​:00​:00 GMT, rdiezmail-perl@​yahoo.de wrote​:

Adding thousands separators to a number is a rather common operation.
Most languages use a printf format string like "%'d", but Perl does
not support it. I suggest that you add this feature and be done with
it.

Otherwise, searching the Internet will throw up a few alternatives
with different degrees of trouble. The following official Perl
documentation page has example code for that task​:

https://perldoc.perl.org/perllocale.html

Search for "Format command line params for current locale" in that
page to find the example code I am talking about.

My first impression when looking at that example was​: A "1 while" loop
with a cryptic regex inside... what an alarming Perl nerdiness sign...

A traditional loop without regular expressions is easier to
understand. Search for "This is for manually testing of
AddThousandsSeparators()" in the following Perl script I recently
wrote for an alternative implementation.

https://github.com/rdiez/Tools/blob/master/AnnotateWithTimestamps/AnnotateWithTimestamps.pl

I have done a quick benchmark, and my loop version is much faster, at
least with Perl v5.22.1.

One bad practice that stands out in the perllocale example is the
usage of 2 variables with the same name but with different types,
namely $grouping and @​grouping. Furthermore, I would not encourage
implicit usage of special variable $_ anymore. Using named variables
makes the code easier to grasp.

But more seriously, the perllocale example code does not work
correctly with the Spanish locale, where the thousands separator is
the period ('.') character, probably because that character has a
special meaning in regular expressions. I guess some quoting is
necessary.

OP and list​: Please review patch attached.

Thank you very much.
--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Author

p5pRT commented Sep 8, 2018

From @jkeenan

0001-Improve-example-using-thousands-separators.patch
From f5412a7716568219caf56c8019bedb9498598ea1 Mon Sep 17 00:00:00 2001
From: James E Keenan <jkeenan@cpan.org>
Date: Sat, 8 Sep 2018 11:11:36 -0400
Subject: [PATCH] Improve example using thousands separators.

For: RT #133504
---
 pod/perllocale.pod | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/pod/perllocale.pod b/pod/perllocale.pod
index 207aea0916..5ee6f0c30e 100644
--- a/pod/perllocale.pod
+++ b/pod/perllocale.pod
@@ -748,6 +748,11 @@ Here's a simple-minded example program that rewrites its command-line
 parameters as integers correctly formatted in the current locale:
 
     use POSIX qw(locale_h);
+    use locale qw(:numeric);
+
+    my $old_locale;
+    $old_locale = setlocale(LC_NUMERIC);
+    setlocale(LC_NUMERIC, "es_ES.utf8");
 
     # Get some of locale's numeric formatting parameters
     my ($thousands_sep, $grouping) =
@@ -767,20 +772,24 @@ parameters as integers correctly formatted in the current locale:
     # right to left (low to high digits).  In the
     # below we cheat slightly by never using anything
     # else than the first grouping (whatever that is).
+
+    my @grouping_list;
     if ($grouping) {
-        @grouping = unpack("C*", $grouping);
+        @grouping_list = unpack("C*", $grouping);
     } else {
-        @grouping = (3);
+        @grouping_list = (3);
     }
 
     # Format command line params for current locale
-    for (@ARGV) {
-        $_ = int;    # Chop non-integer part
+    for my $input (@ARGV) {
+        my $i = int($input);
         1 while
-        s/(\d)(\d{$grouping[0]}($|$thousands_sep))/$1$thousands_sep$2/;
-        print "$_";
+            $i =~ s<(\d)(\d{$grouping_list[0]}($|\Q$thousands_sep\E))>
+                   <$1$thousands_sep$2>x;
+        print "$i\n";
     }
-    print "\n";
+
+    setlocale(LC_NUMERIC, $old_locale);
 
 Note that if the platform doesn't have C<LC_NUMERIC> and/or
 C<LC_MONETARY> available or enabled, the corresponding elements of the
-- 
2.17.1

@p5pRT
Copy link
Author

p5pRT commented Sep 10, 2018

From @Smylers

James E Keenan via RT writes​:

On Sat, 08 Sep 2018 11​:00​:00 GMT, rdiezmail-perl@​yahoo.de wrote​:

I would not encourage implicit usage of special variable $_ anymore.
Using named variables makes the code easier to grasp.

OP and list​: Please review patch attached.

I don't accept that $_ shouldn't be used “any more”, nor that using
named variables instead necessarily makes code easier to read. Note, I'm
not trying to persuade the original poster of the opposite; it's a
matter of style, as such it isn't something that Perl 5 Porters needs to
have a collective position on.

This example happens to be written using $_ as the loop variable, to
refer to the item currently being processed — perfectly idiomatic Perl.
Changing that to use $i instead (which as variables go isn't really
‘named’ anyway) is pointless churn, and sets an awkward precedent for
individuals to submit patches tweaking doc examples to their personal
tastes.

(If it'd already been using $i, I don't think changing it to $_ would've
been worth it either.)

One bad practice that stands out in the perllocale example is the
usage of 2 variables with the same name but with different types,
namely $grouping and @​grouping.

That's a good point. $grouping is a list of integers in a string, and
@​grouping is the same list expanded into an array. Unfortunately I'm not
sure that changing the second to @​grouping_list really helps matters​:
pretty much the entire point of an arrays is to hold a list, so calling
an array @​foo_list is tautologically using the suffix _list in addition
to the prefix @​. And anyway, the scalar variable in this case is also
holding a list.

If you're going to improve this, I think better options would be either
or both of​:

• Instead of saving a hash slice of the return value of localeconv into
  two separate variables, just save all the properties to a hash-ref​:

  my $number_prop = localconv();

  then use $number_prop->{grouping} in place of $grouping.

• Since only the first element of @​grouping is ever used, don't bother
  with the array; only store the first element, in a scalar. Like​:

  my $group_size =
  $number_prop->{grouping} ? unpack 'C', $number_prop->{grouping}
  : 3;

  Then use $group_size in place of $grouping[0].

Using $grouping and @​grouping_list meets the ‘don't use the same name
for different variables’ criterion on a technicality, without really
being clear to readers what they each are.

(Obviously I'm happy with making this work for Spanish.)

Smylers

@p5pRT
Copy link
Author

p5pRT commented Sep 10, 2018

From @Grinnz

On Mon, Sep 10, 2018 at 5​:01 AM Smylers <smylers@​stripey.com> wrote​:

I don't accept that $_ shouldn't be used “any more”, nor that using
named variables instead necessarily makes code easier to read. Note, I'm
not trying to persuade the original poster of the opposite; it's a
matter of style, as such it isn't something that Perl 5 Porters needs to
have a collective position on.

This example happens to be written using $_ as the loop variable, to
refer to the item currently being processed — perfectly idiomatic Perl.
Changing that to use $i instead (which as variables go isn't really
‘named’ anyway) is pointless churn, and sets an awkward precedent for
individuals to submit patches tweaking doc examples to their personal
tastes.

(If it'd already been using $i, I don't think changing it to $_ would've
been worth it either.)

I disagree. In addition to named variables providing better opportunities
for identification (you could name it $line or $entry or whatever is
appropriate if you don't find $i descriptive enough) they also prevent
possibility of bugs stemming from overuse of the $_ global variable. For
instance, if you call a subroutine that modifies $_ and doesn't localize
it, there's nothing you can do about it, but there's no way for outside
code to modify a loop lexical. And the situation is more common than you
think, due to "while (<$fh>)" not localizing $_.

-Dan

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2018

From rdiezmail-perl@yahoo.de

In my opinion, this is the wrong way to fix this issue in the official Perl documentation.

I would just implement the quasi-standard way with printf.

The new code still encourages using special variable $_ , which I would not do anymore.

Say you use your example code to write a subroutine. Is \Q...\E safe? What if the user manages to inject an \E from the outside? If you are going to quote for a regular expression, do it right with a separate variable and quotemeta().

The loop with a regular expression is hard to understand, for no gain. In fact, the implementation with a regular expression is very inefficient compared with a standard loop that indexes on a string.

In fact, for an official FAQ, I would implement a full loop that uses the whole @​grouping_list array, not just the first element. Perl always wanted to have a full locale implementation, didn't it?

Actually, half of my comments above are repetitions of my first post. If you have a different opinion, please state it. I took my time writing all that, and I find it rude just being ignored.

Best regards,
  rdiez

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2018

From rdiezmail-perl@yahoo.de

There are many locales which in some sense are "Spanish".
[...]
My system is configured like this​:

declare -x LANG="es_ES.UTF-8"
declare -x LANGUAGE="es_ES"

But you do not need any locale to test the bug, just set $thousands_sep to '.' before using the regular expression in the perllocale example code.

The code should be robust, no matter what the locale specifics are. You do not want your example code to break if a future locale uses some character which has a special meaning in a regular expression.

Best regards,
  rdiez

@p5pRT
Copy link
Author

p5pRT commented Sep 11, 2018

From @Grinnz

On Tue, Sep 11, 2018 at 12​:50 AM R. Diez via perl5-porters <
perl5-porters@​perl.org> wrote​:

Say you use your example code to write a subroutine. Is \Q...\E safe? What
if the user manages to inject an \E from the outside?

This is not possible. \Q and \E are interpolated directly in the string
they're declared in, they have no literal representation in a string. It's
completely safe to use them for gating input.

-Dan

@p5pRT
Copy link
Author

p5pRT commented Jan 7, 2019

From @tonycoz

On Sat, 08 Sep 2018 08​:22​:24 -0700, jkeenan wrote​:

On Sat, 08 Sep 2018 11​:00​:00 GMT, rdiezmail-perl@​yahoo.de wrote​:

Adding thousands separators to a number is a rather common operation.
Most languages use a printf format string like "%'d", but Perl does
not support it. I suggest that you add this feature and be done with
it.

Otherwise, searching the Internet will throw up a few alternatives
with different degrees of trouble. The following official Perl
documentation page has example code for that task​:

https://perldoc.perl.org/perllocale.html

Search for "Format command line params for current locale" in that
page to find the example code I am talking about.

My first impression when looking at that example was​: A "1 while"
loop
with a cryptic regex inside... what an alarming Perl nerdiness
sign...

A traditional loop without regular expressions is easier to
understand. Search for "This is for manually testing of
AddThousandsSeparators()" in the following Perl script I recently
wrote for an alternative implementation.

https://github.com/rdiez/Tools/blob/master/AnnotateWithTimestamps/AnnotateWithTimestamps.pl

I have done a quick benchmark, and my loop version is much faster, at
least with Perl v5.22.1.

One bad practice that stands out in the perllocale example is the
usage of 2 variables with the same name but with different types,
namely $grouping and @​grouping. Furthermore, I would not encourage
implicit usage of special variable $_ anymore. Using named variables
makes the code easier to grasp.

But more seriously, the perllocale example code does not work
correctly with the Spanish locale, where the thousands separator is
the period ('.') character, probably because that character has a
special meaning in regular expressions. I guess some quoting is
necessary.

OP and list​: Please review patch attached.

Thank you very much.

@​@​ -748,6 +748,11 @​@​ Here's a simple-minded example program that rewrites its command-line
parameters as integers correctly formatted in the current locale​:

  use POSIX qw(locale_h);
+ use locale qw(​:numeric);
+
+ my $old_locale;
+ $old_locale = setlocale(LC_NUMERIC);
+ setlocale(LC_NUMERIC, "es_ES.utf8");

You changed the locale, so it's no longer using what was the current locale.

+
+ my @​grouping_list;
  if ($grouping) {
- @​grouping = unpack("C*", $grouping);
+ @​grouping_list = unpack("C*", $grouping);
  } else {
- @​grouping = (3);
+ @​grouping_list = (3);
  }

As others have said, @​grouping_list isn't great, maybe rename $grouping instead, eg. $packed_grouping.

  # Format command line params for current locale
- for (@​ARGV) {
- $_ = int; # Chop non-integer part
+ for my $input (@​ARGV) {
+ my $i = int($input);

In this case I think $_ is just fine, since it's acting as a current "thing" to work on.

If it were to be renamed, I'd rather the "named" version ($input) be used inside the loop rather than the "not really named" $i.

Unrelated to the patch, the grouping code doesn't interpret the grouping list correctly, though I don't know if there's any locales anywhere that use the full power of the grouping list. From POSIX​:

The elements of grouping and mon_grouping are interpreted according to the following​:

{CHAR_MAX}
  No further grouping is to be performed.
0
  The previous element is to be repeatedly used for the remainder of the digits.
other
  The integer value is the number of digits that comprise the current group. The next element is examined to determine the size of the next group of digits before the current group.

Tony

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants