[PATCH] Docs: perlfunc: Rewrite `split' #11342

p5pRT · 2011-05-15T19:09:35Z

Migrated from rt.perl.org#90632 (status was 'resolved')

Searchable as RT90632$

p5pRT · 2011-05-15T19:09:36Z

From mfwitten@gmail.com

Subject: [PATCH] Docs: perlfunc: Rewrite `split'
Date: Sun, 10 Apr 2011 23:23:21 +0000
From: Michael Witten <mfwitten@gmail.com>

I couldn't stand the way the documenation for `split' was written;
it felt like a kludge of broken English dumped into a messy pile by
several people, each of whom was unaware of the other's work.

This variation completes sentences, adds new ones, rearranges ideas,
expands on ideas, simplifies and unifies examples, and includes more
cross references.

While the original text seemed to be written in a way that touched upon
the arguments in reverse order (which does have a hint of elegance), this
version attempts to provide the reader with the most useful information
upfront.

Signed-off-by: Michael Witten <mfwitten@gmail.com>

pod/perlfunc.pod | 200 ++++++++++++++++++++++++++++++++----------------------
1 files changed, 119 insertions(+), 81 deletions(-)

Inline Patch

diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 2a1b20a..d1ff454 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -5816,117 +5816,155 @@ X<split>
 
 =item split
 
+Splits the string EXPR into a list of strings and returns data
+about that list; in scalar context, the return value is the
+number of fields found, and in list context, the return value
+is the list itself.
+
+If EXPR is omitted (requiring LIMIT to be omitted as well), then
+EXPR defaults to the C<$_> string.
+
+Anything in EXPR that matches PATTERN is taken to be a delimiter
+that separates the EXPR into substrings (called "I<fields>") that do
+B<not> include the delimiter; note that a delimiter may be longer
+than one character or even have no characters at all (the empty string,
+a zero-width match). In the case of the empty string, the EXPR is
+split at the match position (between characters). As an example,
+the following:
+
+   print join(':', split('b', 'abc')), "\n";
+
+uses the 'b' as a delimiter to produce the output 'a:c'. However,
+this:
+
+   print join(':', split('', 'abc')), "\n";
+
+uses the empty string as delimiters to produce the output 'a:b:c'; thus,
+the empty string may be used to split EXPR into a list of its component
+characters; as a special case for C<split>, the empty pattern C<//>
+specifically matches the empty string, which is contrary to the normal
+use of an empty pattern to mean the last successful match.
+
+If PATTERN is C</^/>, then it is treated as if it used the
+L<multiline modifier|perlreref/operators> (C</^/m>), since it isn't much
+use otherwise.
+
+As another special PATTERN case, C<split> emulates the default behavior of
+the command line tool B<awk> when the PATTERN is a I<string> composed of a
+single space character (S<C<' '>>); in this case, any leading whitespace
+in EXPR is removed before splitting occurs, and the PATTERN is instead
+treated as if it were the L<match operator|perlop/m_> syntax C</\s+/>; in
+particular, this means that I<any> whitespace (not just a single space
+character) is used as a delimiter. However, this special treatment can be
+avoided by specifying the PATTERN using the match operator syntax (rather
+than a plain string), thereby allowing a single space character to be a
+delimiter: S<C</ />>.
+
+If PATTERN is omitted, then PATTERN defaults to C<' '> (that is, the
+aformentioned special case).
+
+The PATTERN need not be constant; an expression may be used to specify
+a pattern that varies at runtime. However, it takes time to compile a
+regular expression, so such runtime variation
+L<should be optimized|perlretut/Optimizing pattern evaluation> where
+possible by using e.g. the L<compile modifier (//o)|perlreref/operators>.
+
+If LIMIT is specified and positive, it represents the maximum number
+of fields into which the EXPR may be split; in other words, LIMIT is
+one greater than the maximum number of times EXPR may be split; thus,
+the LIMIT value C<1> means that EXPR may be split a maximum of zero
+times, producing a maximum of one field (namely, the entire value of
+EXPR); for instance:
+
+   print join(':', split(//, 'abc', 1)), "\n";
+
+produces the output 'abc', and this:
+
+   print join(':', split(//, 'abc', 2)), "\n";
+
+produces the output 'a:bc', and each of these:
+
+   print join(':', split(//, 'abc', 3)), "\n";
+   print join(':', split(//, 'abc', 4)), "\n";
+
+produces the output 'a:b:c'.
+
+If LIMIT is negative, it is treated as if it were instead arbitrarily
+large; as many fields as possible are produced.
+
+If LIMIT is omitted (or, equivalently, zero), then it is usually
+treated as if it were instead negative but with the exception that
+trailing empty fields are stripped (empty leading fields are always
+preserved); if all fields are empty, then all fields are considered to
+be trailing (and are thus stripped in this case). Thus, the following:
+
+   print join(':', split(',', 'a,b,c,,,')), "\n";
+
+produces the output 'a:b:c', but the following:
+
+   print join(':', split(',', 'a,b,c,,,', -1)), "\n";
+
+produces the output 'a:b:c:::'.
+
+In time-critical applications, it is worthwhile to avoid splitting
+into more fields than necessary.  Thus, when assigning to a list,
+if LIMIT is omitted (or zero), then C<split> is implicitly given
+a LIMIT that is one larger than the number of variables in the
+list; for the following, LIMIT is implicitly 4:
+
+    ($login, $passwd, $remainder) = split(/:/);
+
+Note that splitting an EXPR that evaluates to the empty string always
+produces zero fields, regardless of the LIMIT specified.
+
+Empty leading fields are produced when there are positive-width matches at
+the beginning of the string; for instance:
+
+   print join(':', split(/ /, ' abc')), "\n";
+
+produces the output ':abc'.  However, a zero-width match at the
+beginning of the string never produces an empty field; for example:
+
+   print join(':', split(//, 'abc'));
+
+produces the output 'a:b:c' (rather than ':a:b:c').
+
+Empty trailing fields, on the other hand, are produced when there is a
+match at the end of the string, regardless of the length of the match
+(of course, unless a non-zero LIMIT is given explicitly, such fields are
+removed, as in the last example); the following:
+
+   print join(':', split(//, 'abc', -1)), "\n";
+
+produces the output 'a:b:c:'.
+
+If the PATTERN contains
+L<regular expression groups|perlretut/Grouping things and hierarchical matching>,
+then for each delimiter, additional fields are produced from the substrings
+captured by each group (in the order in which the groups are specified,
+as per L<backreferences|perlretut/Backreferences>); if any group does not
+match, then it captures C<undef> instead of a substring. Also, note that
+such additional fields are produced whenever there is a delimiter (that
+is, whenever a split occurs), and such additional fields do B<not> count
+towards the LIMIT (in some sense, then, it is better to think of LIMIT as
+one greater than the maximum number of splits that may occur). Consider the
+following expressions evaluated in list context (the returned lists are
+provided in the associated comments):
+
+   split(/-|,/, "1-10,20", 3)
+   # ('1', '10', '20')
+
+   split(/(-|,)/, "1-10,20", 3)
+   # ('1', '-', '10', ',', '20')
+
+   split(/-|(,)/, "1-10,20", 3)
+   # ('1', undef, '10', ',', '20')
+
+   split(/(-)|,/, "1-10,20", 3)
+   # ('1', '-', '10', 'undef', '20')
+
+   split(/(-)|(,)/, "1-10,20", 3)
+   # ('1', '-', undef, '10', undef, ',', '20')
-Splits the string EXPR into a list of strings and returns that list.  By
-default, empty leading fields are preserved, and empty trailing ones are
-deleted.  (If all fields are empty, they are considered to be trailing.)
-
-In scalar context, returns the number of fields found.
-
-If EXPR is omitted, splits the C<$_> string.  If PATTERN is also omitted,
-splits on whitespace (after skipping any leading whitespace).  Anything
-matching PATTERN is taken to be a delimiter separating the fields.  (Note
-that the delimiter may be longer than one character.)
-
-If LIMIT is specified and positive, it represents the maximum number
-of fields the EXPR will be split into, though the actual number of
-fields returned depends on the number of times PATTERN matches within
-EXPR.  If LIMIT is unspecified or zero, trailing null fields are
-stripped (which potential users of C<pop> would do well to remember).
-If LIMIT is negative, it is treated as if an arbitrarily large LIMIT
-had been specified.  Note that splitting an EXPR that evaluates to the
-empty string always returns the empty list, regardless of the LIMIT
-specified.
-
-A pattern matching the empty string (not to be confused with
-an empty pattern C<//>, which is just one member of the set of patterns
-matching the epmty string), splits EXPR into individual
-characters.  For example:
-
-    print join(':', split(/ */, 'hi there')), "\n";
-
-produces the output 'h:i:t:h:e:r:e'.
-
-As a special case for C<split>, the empty pattern C<//> specifically
-matches the empty string; this is not be confused with the normal use
-of an empty pattern to mean the last successful match.  So to split
-a string into individual characters, the following:
-
-    print join(':', split(//, 'hi there')), "\n";
-
-produces the output 'h:i: :t:h:e:r:e'.
-
-Empty leading fields are produced when there are positive-width matches at
-the beginning of the string; a zero-width match at the beginning of
-the string does not produce an empty field. For example:
-
-   print join(':', split(/(?=\w)/, 'hi there!'));
-
-produces the output 'h:i :t:h:e:r:e!'. Empty trailing fields, on the other
-hand, are produced when there is a match at the end of the string (and
-when LIMIT is given and is not 0), regardless of the length of the match.
-For example:
-
-   print join(':', split(//,   'hi there!', -1)), "\n";
-   print join(':', split(/\W/, 'hi there!', -1)), "\n";
-
-produce the output 'h:i: :t:h:e:r:e:!:' and 'hi:there:', respectively,
-both with an empty trailing field.
-
-The LIMIT parameter can be used to split a line partially
-
-    ($login, $passwd, $remainder) = split(/:/, $_, 3);
-
-When assigning to a list, if LIMIT is omitted, or zero, Perl supplies
-a LIMIT one larger than the number of variables in the list, to avoid
-unnecessary work.  For the list above LIMIT would have been 4 by
-default.  In time critical applications it behooves you not to split
-into more fields than you really need.
-
-If the PATTERN contains parentheses, additional list elements are
-created from each matching substring in the delimiter.
-
-    split(/([,-])/, "1-10,20", 3);
-
-produces the list value
-
-    (1, '-', 10, ',', 20)
-
-If you had the entire header of a normal Unix email message in $header,
-you could split it up into fields and their values this way:
-
-    $header =~ s/\n(?=\s)//g;  # fix continuation lines
-    %hdrs   =  (UNIX_FROM => split /^(\S*?):\s*/m, $header);
-
-The pattern C</PATTERN/> may be replaced with an expression to specify
-patterns that vary at runtime.  (To do runtime compilation only once,
-use C</$variable/o>.)
-
-As a special case, specifying a PATTERN of space (S<C<' '>>) will split on
-white space just as C<split> with no arguments does.  Thus, S<C<split(' ')>> can
-be used to emulate B<awk>'s default behavior, whereas S<C<split(/ /)>>
-will give you as many initial null fields (empty string) as there are leading spaces.
-A C<split> on C</\s+/> is like a S<C<split(' ')>> except that any leading
-whitespace produces a null first field.  A C<split> with no arguments
-really does a S<C<split(' ', $_)>> internally.
-
-A PATTERN of C</^/> is treated as if it were C</^/m>, since it isn't
-much use otherwise.
-
-Example:
-
-    open(PASSWD, '/etc/passwd');
-    while (<PASSWD>) {
-        chomp;
-        ($login, $passwd, $uid, $gid,
-         $gcos, $home, $shell) = split(/:/);
-        #...
-    }
-
-As with regular pattern matching, any capturing parentheses that are not
-matched in a C<split()> will be set to C<undef> when returned:
-
-    @fields = split /(A)|B/, "1A2B3";
-    # @fields is (1, 'A', 2, undef, 3)
 
 =item sprintf FORMAT, LIST
 X<sprintf>
--

1.7.4.18.g68fe8

Flags:
category=docs
severity=low

p5pRT · 2011-05-16T13:40:18Z

From tchrist@perl.com

I couldn't stand the way the documenation for `split' was written;
it felt like a kludge of broken English dumped into a messy pile by
several people, each of whom was unaware of the other's work.

I think you'll find that description fits quite a bit of the documentation.

It's a real problem.

--tom

p5pRT · 2011-05-16T13:40:19Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2011-05-16T14:14:39Z

From bmb@Mail.Libs.UGA.EDU

On Sun, May 15, 2011 at 3:09 PM, Michael Witten
<perlbug-followup@perl.org> wrote:

# New Ticket Created by Michael Witten
# Please include the string: [perl #90632]
# in the subject line of all future correspondence about this issue.
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=90632 >

Subject: [PATCH] Docs: perlfunc: Rewrite `split'
Date: Sun, 10 Apr 2011 23:23:21 +0000
From: Michael Witten <mfwitten@gmail.com>

I couldn't stand the way the documenation for `split' was written;
it felt like a kludge of broken English dumped into a messy pile by
several people, each of whom was unaware of the other's work.

On a quick read, I'd like to make one suggestion. You have
a lot of clauses combined with a semicolon. I think that in
every case, a new sentence would be better.

A period let's me take a mental breath. With a semicolon, I
have to keep holding that breath, because you're telling me
that the sentence isn't finished yet. In my opinion, those long
sentences would work fine split into two. :-)

--
Brad

p5pRT · 2011-05-16T17:11:47Z

From tsibley@cpan.org

There's a quoted undef in the second to last split example.

p5pRT · 2011-05-17T20:06:39Z

From mfwitten@gmail.com

Here is an updated patch.

Save this message /tmp/p and then apply it as follows:

$ cd /path/to/perl/repo
$ git am --scissors /tmp/p

8<-----------8<-----------8<-----------8<-----------8<-----------8<-----------

Date: Sun, 10 Apr 2011 23:23:21 +0000

I couldn't stand the way the documenation for `split' was written;
it felt like a kludge of broken English dumped into a messy pile by
several people, each of whom was unaware of the other's work.

This variation completes sentences, adds new ones, rearranges ideas,
expands on ideas, simplifies and unifies examples, and includes more
cross references.

While the original text seemed to be written in a way that touched upon
the arguments in reverse order (which did have a hint of elegance), this
version attempts to provide the reader with the most useful information
upfront.

Thanks to Brad Baxter and Thomas R. Sibley for their constructive
criticism.

Signed-off-by: Michael Witten <mfwitten@gmail.com>

pod/perlfunc.pod | 206 +++++++++++++++++++++++++++++++++---------------------
1 files changed, 125 insertions(+), 81 deletions(-)

Inline Patch

diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod
index 2a1b20a..11432d8 100644
--- a/pod/perlfunc.pod
+++ b/pod/perlfunc.pod
@@ -5816,117 +5816,161 @@ X<split>
 
 =item split
 
+Splits the string EXPR into a list of strings and returns data
+about that list: In scalar context, the return value is the
+number of fields found, and in list context, the return value
+is the list itself.
+
+If EXPR is omitted (requiring LIMIT to be omitted as well), then
+EXPR defaults to the C<$_> string.
+
+Anything in EXPR that matches PATTERN is taken to be a delimiter
+that separates the EXPR into substrings (called "I<fields>") that
+do B<not> include the delimiter. Note that a delimiter may be
+longer than one character or even have no characters at all (the
+empty string, which is a zero-width match).
+
+The PATTERN need not be constant; an expression may be used
+to specify a pattern that varies at runtime. However, it
+takes time to compile a regular expression, so such runtime
+variation L<should be optimized|perlretut/Optimizing pattern
+evaluation> where possible by using e.g. the L<compile modifier
+(//o)|perlreref/operators>.
+
+If PATTERN matches the empty string, the EXPR is split at the match
+position (between characters). As an example, the following:
+
+   print join(':', split('b', 'abc')), "\n";
+
+uses the 'b' in 'abc' as a delimiter to produce the output 'a:c'.
+However, this:
+
+   print join(':', split('', 'abc')), "\n";
+
+uses empty string matches as delimiters to produce the output
+'a:b:c'; thus, the empty string may be used to split EXPR into a
+list of its component characters.
+
+As a special case for C<split>, the empty pattern given in
+L<match operator|perlop/m_> syntax (C<//>) specifically matches
+the empty string, which is contrary to its usual interpretation
+as the last successful match.
+
+If PATTERN is C</^/>, then it is treated as if it used the
+L<multiline modifier|perlreref/operators> (C</^/m>), since it
+isn't much use otherwise.
+
+As another special case, C<split> emulates the default behavior of
+the command line tool B<awk> when the PATTERN is a I<string> composed
+of a single space character (S<C<' '>>): Any leading whitespace in
+EXPR is removed before splitting occurs, and the PATTERN is instead
+treated as if it were C</\s+/>; in particular, this means that I<any>
+whitespace (not just a single space character) is used as a delimiter.
+However, this special treatment can be avoided by specifying the
+PATTERN using the match operator syntax (rather than a plain string),
+thereby allowing a single space character to be a delimiter: S<C</
+/>>.
+
+If PATTERN is omitted, then PATTERN defaults to a string composed
+of a single space character (S<C<' '>>), which invokes the
+aformentioned B<awk> emulation.
+
+If LIMIT is specified and positive, it represents the maximum number
+of fields into which the EXPR may be split; in other words, LIMIT is
+one greater than the maximum number of times EXPR may be split. Thus,
+the LIMIT value C<1> means that EXPR may be split a maximum of zero
+times, producing a maximum of one field (namely, the entire value of
+EXPR). For instance:
+
+   print join(':', split(//, 'abc', 1)), "\n";
+
+produces the output 'abc', and this:
+
+   print join(':', split(//, 'abc', 2)), "\n";
+
+produces the output 'a:bc', and each of these:
+
+   print join(':', split(//, 'abc', 3)), "\n";
+   print join(':', split(//, 'abc', 4)), "\n";
+
+produces the output 'a:b:c'.
+
+If LIMIT is negative, it is treated as if it were instead arbitrarily
+large; as many fields as possible are produced.
+
+If LIMIT is omitted (or, equivalently, zero), then it is usually
+treated as if it were instead negative but with the exception that
+trailing empty fields are stripped (empty leading fields are always
+preserved); if all fields are empty, then all fields are considered to
+be trailing (and are thus stripped in this case). Thus, the following:
+
+   print join(':', split(',', 'a,b,c,,,')), "\n";
+
+produces the output 'a:b:c', but the following:
+
+   print join(':', split(',', 'a,b,c,,,', -1)), "\n";
+
+produces the output 'a:b:c:::'.
+
+In time-critical applications, it is worthwhile to avoid splitting
+into more fields than necessary. Thus, when assigning to a list,
+if LIMIT is omitted (or zero), then LIMIT is treated as though it
+were one larger than the number of variables in the list; for the
+following, LIMIT is implicitly 4:
+
+    ($login, $passwd, $remainder) = split(/:/);
+
+Note that splitting an EXPR that evaluates to the empty string always
+produces zero fields, regardless of the LIMIT specified.
+
+An empty leading field is produced when there is a positive-width
+match at the beginning of EXPR. For instance:
+
+   print join(':', split(/ /, ' abc')), "\n";
+
+produces the output ':abc'. However, a zero-width match at the
+beginning of EXPR never produces an empty field, so that:
+
+   print join(':', split(//, ' abc'));
+
+produces the output S<' :a:b:c'> (rather than S<': :a:b:c'>).
+
+An empty trailing field, on the other hand, is produced when there is a
+match at the end of EXPR, regardless of the length of the match
+(of course, unless a non-zero LIMIT is given explicitly, such fields are
+removed, as in the last example). Thus:
+
+   print join(':', split(//, ' abc', -1)), "\n";
+
+produces the output S<' :a:b:c:'>.
+
+If the PATTERN contains
+L<regular expression groups|perlretut/Grouping things and hierarchical matching>,
+then for each delimiter, an additional field is produced for each substring
+captured by a group (in the order in which the groups are specified,
+as per L<backreferences|perlretut/Backreferences>); if any group does not
+match, then it captures the C<undef> value instead of a substring. Also, note
+that any such additional field is produced whenever there is a delimiter (that
+is, whenever a split occurs), and such an additional field does B<not> count
+towards the LIMIT (in some sense, then, it is better to think of LIMIT as
+one greater than the maximum number of splits that may occur). Consider the
+following expressions evaluated in list context (each returned list is provided
+in the associated comment):
+
+   split(/-|,/, "1-10,20", 3)
+   # ('1', '10', '20')
+
+   split(/(-|,)/, "1-10,20", 3)
+   # ('1', '-', '10', ',', '20')
+
+   split(/-|(,)/, "1-10,20", 3)
+   # ('1', undef, '10', ',', '20')
+
+   split(/(-)|,/, "1-10,20", 3)
+   # ('1', '-', '10', undef, '20')
+
+   split(/(-)|(,)/, "1-10,20", 3)
+   # ('1', '-', undef, '10', undef, ',', '20')
-Splits the string EXPR into a list of strings and returns that list.  By
-default, empty leading fields are preserved, and empty trailing ones are
-deleted.  (If all fields are empty, they are considered to be trailing.)
-
-In scalar context, returns the number of fields found.
-
-If EXPR is omitted, splits the C<$_> string.  If PATTERN is also omitted,
-splits on whitespace (after skipping any leading whitespace).  Anything
-matching PATTERN is taken to be a delimiter separating the fields.  (Note
-that the delimiter may be longer than one character.)
-
-If LIMIT is specified and positive, it represents the maximum number
-of fields the EXPR will be split into, though the actual number of
-fields returned depends on the number of times PATTERN matches within
-EXPR.  If LIMIT is unspecified or zero, trailing null fields are
-stripped (which potential users of C<pop> would do well to remember).
-If LIMIT is negative, it is treated as if an arbitrarily large LIMIT
-had been specified.  Note that splitting an EXPR that evaluates to the
-empty string always returns the empty list, regardless of the LIMIT
-specified.
-
-A pattern matching the empty string (not to be confused with
-an empty pattern C<//>, which is just one member of the set of patterns
-matching the epmty string), splits EXPR into individual
-characters.  For example:
-
-    print join(':', split(/ */, 'hi there')), "\n";
-
-produces the output 'h:i:t:h:e:r:e'.
-
-As a special case for C<split>, the empty pattern C<//> specifically
-matches the empty string; this is not be confused with the normal use
-of an empty pattern to mean the last successful match.  So to split
-a string into individual characters, the following:
-
-    print join(':', split(//, 'hi there')), "\n";
-
-produces the output 'h:i: :t:h:e:r:e'.
-
-Empty leading fields are produced when there are positive-width matches at
-the beginning of the string; a zero-width match at the beginning of
-the string does not produce an empty field. For example:
-
-   print join(':', split(/(?=\w)/, 'hi there!'));
-
-produces the output 'h:i :t:h:e:r:e!'. Empty trailing fields, on the other
-hand, are produced when there is a match at the end of the string (and
-when LIMIT is given and is not 0), regardless of the length of the match.
-For example:
-
-   print join(':', split(//,   'hi there!', -1)), "\n";
-   print join(':', split(/\W/, 'hi there!', -1)), "\n";
-
-produce the output 'h:i: :t:h:e:r:e:!:' and 'hi:there:', respectively,
-both with an empty trailing field.
-
-The LIMIT parameter can be used to split a line partially
-
-    ($login, $passwd, $remainder) = split(/:/, $_, 3);
-
-When assigning to a list, if LIMIT is omitted, or zero, Perl supplies
-a LIMIT one larger than the number of variables in the list, to avoid
-unnecessary work.  For the list above LIMIT would have been 4 by
-default.  In time critical applications it behooves you not to split
-into more fields than you really need.
-
-If the PATTERN contains parentheses, additional list elements are
-created from each matching substring in the delimiter.
-
-    split(/([,-])/, "1-10,20", 3);
-
-produces the list value
-
-    (1, '-', 10, ',', 20)
-
-If you had the entire header of a normal Unix email message in $header,
-you could split it up into fields and their values this way:
-
-    $header =~ s/\n(?=\s)//g;  # fix continuation lines
-    %hdrs   =  (UNIX_FROM => split /^(\S*?):\s*/m, $header);
-
-The pattern C</PATTERN/> may be replaced with an expression to specify
-patterns that vary at runtime.  (To do runtime compilation only once,
-use C</$variable/o>.)
-
-As a special case, specifying a PATTERN of space (S<C<' '>>) will split on
-white space just as C<split> with no arguments does.  Thus, S<C<split(' ')>> can
-be used to emulate B<awk>'s default behavior, whereas S<C<split(/ /)>>
-will give you as many initial null fields (empty string) as there are leading spaces.
-A C<split> on C</\s+/> is like a S<C<split(' ')>> except that any leading
-whitespace produces a null first field.  A C<split> with no arguments
-really does a S<C<split(' ', $_)>> internally.
-
-A PATTERN of C</^/> is treated as if it were C</^/m>, since it isn't
-much use otherwise.
-
-Example:
-
-    open(PASSWD, '/etc/passwd');
-    while (<PASSWD>) {
-        chomp;
-        ($login, $passwd, $uid, $gid,
-         $gcos, $home, $shell) = split(/:/);
-        #...
-    }
-
-As with regular pattern matching, any capturing parentheses that are not
-matched in a C<split()> will be set to C<undef> when returned:
-
-    @fields = split /(A)|B/, "1A2B3";
-    # @fields is (1, 'A', 2, undef, 3)
 
 =item sprintf FORMAT, LIST
 X<sprintf>
-- 
1.7.4.18.g68fe8

p5pRT · 2011-05-18T00:50:57Z

From bmb@Mail.Libs.UGA.EDU

I won't belabor this point beyond this post, but I did want
to give examples of what I meant about (IMO) a bit of overuse
of semicolon.

Note that every change I'm suggesting is simply to replace
the semicolon with a period, with obvious punctuation changes.

Cheers ...

I think the following sentence is fine.

The PATTERN need not be constant; an expression may be used
to specify a pattern that varies at runtime.

But I'd change this:

However, this:

print join(':', split('', 'abc')), "\n";

uses empty string matches as delimiters to produce the output
'a:b:c'; thus, the empty string may be used to split EXPR into a
list of its component characters.

to this:

However, this:

print join(':', split('', 'abc')), "\n";

uses empty string matches as delimiters to produce the output
'a:b:c'. Thus, the empty string may be used to split EXPR into a
list of its component characters.

I would change this:

As another special case, C<split> emulates the default behavior of
the command line tool B<awk> when the PATTERN is a I<string> composed
of a single space character (S<C<' '>>): Any leading whitespace in
EXPR is removed before splitting occurs, and the PATTERN is instead
treated as if it were C</\s+/>; in particular, this means that I<any>
whitespace (not just a single space character) is used as a delimiter.

to this:

As another special case, C<split> emulates the default behavior of
the command line tool B<awk> when the PATTERN is a I<string> composed
of a single space character (S<C<' '>>): Any leading whitespace in
EXPR is removed before splitting occurs, and the PATTERN is instead
treated as if it were C</\s+/>. In particular, this means that I<any>
whitespace (not just a single space character) is used as a delimiter.

And change this:

If LIMIT is specified and positive, it represents the maximum number
of fields into which the EXPR may be split; in other words, LIMIT is
one greater than the maximum number of times EXPR may be split.

to this:

If LIMIT is specified and positive, it represents the maximum number
of fields into which the EXPR may be split. In other words, LIMIT is
one greater than the maximum number of times EXPR may be split.

I think this is fine:

If LIMIT is negative, it is treated as if it were instead arbitrarily
large; as many fields as possible are produced.

But I'd change this:

If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are always
preserved); if all fields are empty, then all fields are considered to
be trailing (and are thus stripped in this case).

to this:

If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are always
preserved). If all fields are empty, then all fields are considered to
be trailing (and are thus stripped in this case).

This:

In time-critical applications, it is worthwhile to avoid splitting
into more fields than necessary. Thus, when assigning to a list,
if LIMIT is omitted (or zero), then LIMIT is treated as though it
were one larger than the number of variables in the list; for the
following, LIMIT is implicitly 4:

($login, $passwd, $remainder) = split(/:/);

to this:

In time-critical applications, it is worthwhile to avoid splitting
into more fields than necessary. Thus, when assigning to a list,
if LIMIT is omitted (or zero), then LIMIT is treated as though it
were one larger than the number of variables in the list. For the
following, LIMIT is implicitly 4:

($login, $passwd, $remainder) = split(/:/);

And this:

If the PATTERN contains
L<regular expression groups|perlretut/Grouping things and hierarchical
matching>,
then for each delimiter, an additional field is produced for each substring
captured by a group (in the order in which the groups are specified,
as per L<backreferences|perlretut/Backreferences>); if any group does not
match, then it captures the C<undef> value instead of a substring.

to this:

If the PATTERN contains
L<regular expression groups|perlretut/Grouping things and hierarchical
matching>,
then for each delimiter, an additional field is produced for each substring
captured by a group (in the order in which the groups are specified,
as per L<backreferences|perlretut/Backreferences>). If any group does not
match, then it captures the C<undef> value instead of a substring.

--
Brad

p5pRT · 2011-05-18T02:05:52Z

From bmb@Mail.Libs.UGA.EDU

On Tue, May 17, 2011 at 3:56 PM, Michael Witten <mfwitten@gmail.com> wrote:

+Splits the string EXPR into a list of strings and returns data
+about that list: In scalar context, the return value is the
+number of fields found, and in list context, the return value
+is the list itself.
+
+If EXPR is omitted (requiring LIMIT to be omitted as well), then
+EXPR defaults to the C<$_> string.

FWIW, I'm uneasy about this patch. After getting over the semicolons,
I started trying to get a handle on the content. I got as far as thinking
that the above could read better this way:

Splits the string EXPR into a list of substrings (or I<fields>).
In list context, it returns the list. In scalar context, it
returns the number of fields.

If EXPR and LIMIT are omitted, EXPR defaults to the C<$_> string.

But then I started not being able to really say whether the patch
explains things better than the original. It's more verbose, including
more details, but -- maybe because I'm so used to it -- the original
still seems to get the points across in a way that for me is a bit
easier to follow.

I wonder if a split tutorial might be a better way to include the extra
details. So I think I'll bow out of the discussion for now, because
I'm not sure I can really add any more.

Regards,

Brad

p5pRT · 2011-05-18T16:02:26Z

From mfwitten@gmail.com

On Wed, May 18, 2011 at 00:50, Brad Baxter <bmb@mail.libs.uga.edu> wrote:

I won't belabor this point beyond this post, but I did want
to give examples of what I meant about (IMO) a bit of overuse
of semicolon.

Note that every change I'm suggesting is simply to replace
the semicolon with a period, with obvious punctuation changes.

Cheers ...

Do not think that I don't appreciate the fact that you've taken the
time to read through my patches, and I do not feel that you are
belaboring the point.

I do agree that I was a little too liberal with the semicolons on the
first patch, and I removed some of them accordingly; there are 9 fewer
semicolons in the second patch. However, I don't think that I need to
make any further reductions.

To me, the semicolon provides a means for connecting 2 highly related
statements without having to give up a simple sentence structure. It
just so happens that within a technical description like the one
provided in these patches, many of the sentences are strongly coupled
in this way.

p5pRT · 2011-05-18T16:35:19Z

From mfwitten@gmail.com

On Tue, 17 May 2011 22:05:21 -0400, Brad Baxter wrote:

On Tue, May 17, 2011 at 3:56 PM, Michael Witten <mfwitten@gmail.com> wrote:

+Splits the string EXPR into a list of strings and returns data
+about that list: In scalar context, the return value is the
+number of fields found, and in list context, the return value
+is the list itself.
+
+If EXPR is omitted (requiring LIMIT to be omitted as well), then
+EXPR defaults to the C<$_> string.

FWIW, I'm uneasy about this patch. After getting over the semicolons,
I started trying to get a handle on the content. I got as far as thinking
that the above could read better this way:

---
Splits the string EXPR into a list of substrings (or I<fields>).
In list context, it returns the list. In scalar context, it
returns the number of fields.

If EXPR and LIMIT are omitted, EXPR defaults to the C<$_> string.
---

I think the problem here is that I use `strings' and then `fields'. First,
consider that my text introduces the term `fields' more formally shortly
thereafter:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

Therefore, I think it is best to view the first paragraph as a quick
overview of the description that follows. Thus, the word `fields' should
be used exclusively (also, the return value should be described in 2
separate sentences, as you suggest):

Splits the string EXPR into a list of fields and returns data
about that list: In scalar context, the return value is the
number of fields found. In list context, the return value
is the list itself.

After getting over the semicolons, I started trying to get a handle on
the content. I got as far as [the first few sentences, which I don't
like]... So I think I'll bow out of the discussion for now, because
I'm not sure I can really add any more.

Oh, come on, now; don't be so melodramatic.

I get the feeling that you are a lot like I am: The second things feel
gross (especially early on), you want to wash your hands of it.

However, that is the very purpose of discussion: To figure out what is
gross and to clean it. So, please don't bow out just yet; how can my
patch be improved without people telling me how awful it is?

I started not being able to really say whether the patch explains
things better than the original. It's more verbose, including more
details, but -- maybe because I'm so used to it -- the original still
seems to get the points across in a way that for me is a bit easier
to follow.

I don't think you are in a position to make that judgment until you have
read my entire patch with same exacting eye.

I took another look at the existing text and my resolve is stronger than
ever. Any grossness in my patch is far outdone by that of the existing text.

Sincerely,
Michael Witten

p5pRT · 2011-05-22T20:34:36Z

From @cpansprout

On Tue May 17 13:06:39 2011, mfwitten wrote:

Here is an updated patch.

I have some criticism to offer, which I hope you will find more
constructive that obstructive.

+The PATTERN need not be constant; an expression may be used
+to specify a pattern that varies at runtime. However, it
+takes time to compile a regular expression, so such runtime
+variation L<should be optimized|perlretut/Optimizing pattern
+evaluation> where possible by using e.g. the L<compile modifier
+(//o)|perlreref/operators>.

There is no need to mention /o. In fact, I find it confusing. /o does
not provide much of a speed increase any more (ignoring tied variables
and overloaded objects), because regexp compilation is skipped if the
string has not changed. This paragraph also disregards qr//.

+
+If PATTERN matches the empty string, the EXPR is split at the match
+position (between characters). As an example, the following:
+
+ print join(':', split('b', 'abc')), "\n";
+
+uses the 'b' in 'abc' as a delimiter to produce the output 'a:c'.
+However, this:
+
+ print join(':', split('', 'abc')), "\n";
+
+uses empty string matches as delimiters to produce the output
+'a:b:c'; thus, the empty string may be used to split EXPR into a
+list of its component characters.
+
+As a special case for C<split>, the empty pattern given in
+L<match operator|perlop/m_> syntax (C<//>) specifically matches

I don’t think that’s a valid pod link. It should be
L<...|perlop/"m/PATTERN/msixpodualgc"> or something like that.

+the empty string, which is contrary to its usual interpretation
+as the last successful match.
+
+If PATTERN is C</^/>, then it is treated as if it used the
+L<multiline modifier|perlreref/operators> (C</^/m>), since it
+isn't much use otherwise.
+
+As another special case, C<split> emulates the default behavior of
+the command line tool B<awk> when the PATTERN is a I<string> composed
+of a single space character (S<C<' '>>): Any leading whitespace in

The split " " behaviour only applies to a literal string, not to a
variable . (But, due to optimisation, it *does* apply to split(" "."",
$stuff).)

+EXPR is removed before splitting occurs, and the PATTERN is instead
+treated as if it were C</\s+/>; in particular, this means that I<any>
+whitespace (not just a single space character) is used as a
delimiter.
+However, this special treatment can be avoided by specifying the
+PATTERN using the match operator syntax (rather than a plain string),
+thereby allowing a single space character to be a delimiter: S<C</
+/>>.

s/thereby allowing/thereby only allowing/

+If the PATTERN contains
+L<regular expression groups|perlretut/Grouping things and
hierarchical matching>,

Please say ‘capturing groups’. Just ‘groups’ would include (?:...).

+then for each delimiter, an additional field is produced for each
substring
+captured by a group (in the order in which the groups are specified,
+as per L<backreferences|perlretut/Backreferences>); if any group does
not
+match, then it captures the C<undef> value instead of a substring.
Also, note
+that any such additional field is produced whenever there is a
delimiter (that
+is, whenever a split occurs), and such an additional field does
B<not> count
+towards the LIMIT (in some sense, then, it is better to think of
LIMIT as
+one greater than the maximum number of splits that may occur).

I think that parenthetical remark can be removed, as it repeats
something stated earlier.

Also, do you want to document this? :-)

()=@==split" ","Just another Perl hacker,\n";
print reverse@=

p5pRT · 2011-05-23T06:23:20Z

From @ap

* Brad Baxter <bmb@mail.libs.uga.edu> [2011-05-18 04:10]:

On Tue, May 17, 2011 at 3:56 PM, Michael Witten <mfwitten@gmail.com> wrote:

+Splits the string EXPR into a list of strings and returns data
+about that list: In scalar context, the return value is the
+number of fields found, and in list context, the return value
+is the list itself.
+
+If EXPR is omitted (requiring LIMIT to be omitted as well), then
+EXPR defaults to the C<$_> string.

FWIW, I'm uneasy about this patch. After getting over the
semicolons, I started trying to get a handle on the content.

“Returns data about the list” for something usually used to get
that list itself sounds almost comically lawyeristic.

I got as far as thinking that the above could read better this
way:

---
Splits the string EXPR into a list of substrings (or I<fields>).
In list context, it returns the list. In scalar context, it
returns the number of fields.

If EXPR and LIMIT are omitted, EXPR defaults to the C<$_> string.
---

Agreed. In fact, I suggest cutting away even more:

Splits the string EXPR into a list of strings and returns the
list in list context, or the size of the list in scalar context.

If only PATTERN is given, EXPR defaults to C<$_>.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

p5pRT · 2011-05-23T12:34:19Z

From bmb@Mail.Libs.UGA.EDU

On Wed, May 18, 2011 at 12:01 PM, Michael Witten <mfwitten@gmail.com> wrote:

To me, the semicolon provides a means for connecting 2 highly related
statements without having to give up a simple sentence structure. It
just so happens that within a technical description like the one
provided in these patches, many of the sentences are strongly coupled
in this way.

And I think that the fact those sentences are in the same paragraph
is enough to communicate that they are related. But we've moved into
the realm of style, and I'm happy to concede that not everyone agrees
with mine. :-)

--
Brad

p5pRT · 2011-05-23T12:39:40Z

From bmb@Mail.Libs.UGA.EDU

On Wed, May 18, 2011 at 12:34 PM, Michael Witten <mfwitten@gmail.com> wrote:

On Tue, 17 May 2011 22:05:21 -0400, Brad Baxter wrote:

After getting over the semicolons, I started trying to get a handle on
the content. I got as far as [the first few sentences, which I don't
like]... So I think I'll bow out of the discussion for now, because
I'm not sure I can really add any more.

Oh, come on, now; don't be so melodramatic.

Not my intention. :-) More accurately, I'm not sure I have the spare
time at the moment to study the whole picture and add something
truly brilliant. So I'll be lazy and let others chime in who perhaps do.

--
Brad

p5pRT · 2011-05-24T19:05:39Z

From mfwitten@gmail.com

On Sun, 22 May 2011 13:34:36 -0700, Father Chrysostomos wrote:

+The PATTERN need not be constant; an expression may be used
+to specify a pattern that varies at runtime. However, it
+takes time to compile a regular expression, so such runtime
+variation L<should be optimized|perlretut/Optimizing pattern
+evaluation> where possible by using e.g. the L<compile modifier
+(//o)|perlreref/operators>.

There is no need to mention /o. In fact, I find it confusing. /o does
not provide much of a speed increase any more (ignoring tied variables
and overloaded objects), because regexp compilation is skipped if the
string has not changed. This paragraph also disregards qr//.

I figured as much, but I felt that it would be best to be told to
remove `/o' rather than to do so on my own initiative. Thanks.

As for `qr//', I actually did mention it in a previous, unpublished
version, but my testing indicates that `qr//' did not provide the
expected improvements in efficiency, so I just left it out.

However, removing the discussion of `/o' (and leaving out qr//)
would cause any mention of efficiency to become unhelpful.
Should that paragraph simply be reduced to he following?

The PATTERN need not be constant; an expression may be used to
specify a pattern that varies at runtime.

Sincerely,
Michael Witten

p5pRT · 2011-05-24T19:06:17Z

From mfwitten@gmail.com

On Sun, 22 May 2011 13:34:36 -0700, Father Chrysostomos wrote:

+As a special case for C<split>, the empty pattern given in
+L<match operator|perlop/m_> syntax (C<//>) specifically matches

I don’t think that’s a valid pod link. It should be
L<...|perlop/"m/PATTERN/msixpodualgc"> or something like that.

Well, when I was first writing this patch, I was going to do exactly
that, but I found the following in `perlpod':

"L<name>" -- a hyperlink
There are various syntaxes, listed below. In the
syntaxes given, "text", "name", and "section"
cannot contain the characters '/' and '|'; and
any '<' or '>' should be matched.

Thus, the obvious choice (your suggestion) is incorrect. At the time, I
figured the proper workaround would be to use the escape formatting
code:

"E<escape>" -- a character escape
Very similar to HTML/XML "&foo;" "entity references":

...

· "E<sol>" = a literal / (solidus)

The above [is] optional except in other formatting codes,
notably "L<...>", and when preceded by a capital letter.

like this:

L<match operator|perlop/mE<sol>PATTERNE<sol>msixpodualgc>

which ends up producing an HTML hyperlink with the following `href':

/pod/perlop.html#me_sol_patterne_sol_msixpodualgc

Unfortunately, that doesn't actually link to anything because the
actual anchor in `perlop.html' is:

m_

which is what I ended up using directly.

In fact, if you use the following [disallowed] POD text:

L<match operator|perlop/m/anything you want to write>

then you'll get a link that works, because `pod2html' produces an HTML
hyperlink with the following `href':

m_

Consequently, I'm actually led to believe that the `pod2html' is
producing bad anchors in such situations; in particular, the following
POD lines from `perlop.html':

=item m/PATTERN/msixpodualgc
=item /PATTERN/msixpodualgc

produce the following corresponding HTML anchor values:

m_
pattern_msixpodualgc

How does that make sense?

They should probably produce whatever these corresponding POD lines
would produce:

L<match operator|perlop/mE<sol>PATTERNE<sol>msixpodualgc>
L<match operator|perlop/E<sol>PATTERNE<sol>msixpodualgc>

So... Where do we go from here?

Sincerely,
Michael Witten

p5pRT · 2011-05-24T19:44:52Z

From mfwitten@gmail.com

On Sun, 22 May 2011 13:34:36 -0700, Father Chrysostomos wrote:

+As another special case, C<split> emulates the default behavior of
+the command line tool B<awk> when the PATTERN is a I<string> composed
+of a single space character (S<C<' '>>): Any leading whitespace in

The split " " behaviour only applies to a literal string, not to a
variable . (But, due to optimisation, it *does* apply to split(" "."",
$stuff).)

+EXPR is removed before splitting occurs, and the PATTERN is instead
+treated as if it were C</\s+/>; in particular, this means that I<any>
+whitespace (not just a single space character) is used as a
delimiter.
+However, this special treatment can be avoided by specifying the
+PATTERN using the match operator syntax (rather than a plain string),
+thereby allowing a single space character to be a delimiter: S<C</
+/>>.

s/thereby allowing/thereby only allowing/

Here's my attempt to incorporate those points:

As another special case, C<split> emulates the default behavior of the
command line tool B<awk> when the PATTERN is any expression that is
optimized into a I<literal string> composed of a single space character
(such as S<C<' '>> or S<C<' '.''>> or a similar compile-time constant).
In this case, any leading whitespace in EXPR is removed before splitting
occurs, and the PATTERN is instead treated as if it were C</\s+/>; in
particular, this means that I<any> whitespace (not just a single space
character) is used as a delimiter. However, this special treatment can be
avoided by specifying the PATTERN using the match operator syntax (rather
than a plain string), thereby allowing only a single space character to be
a delimiter: S<C</ />>.

If PATTERN is omitted, then PATTERN defaults to a literal string composed
of a single space character (S<C<' '>>), which invokes the aformentioned
B<awk> emulation.

Sincerely,
Michael Witten

p5pRT · 2011-05-24T19:55:00Z

From tchrist@perl.com

command line tool B<awk> when the PATTERN is any expression that is
optimized into a I<literal string> composed of a single space character
(such as S<C<' '>> or S<C<' '.''>> or a similar compile-time constant).

such as C<" "> or C<"\x20">

In this case, any leading whitespace in EXPR is removed before splitting
occurs, and the PATTERN is instead treated as if it were C</\s+/>; in

Leading whitespace in EXPR is ignored, and the PATTERN C</\s+/> is used
for splitting.

particular, this means that I<any> whitespace (not just a single space

any stretch of one or more whitespace characters

character) is used as a delimiter. However, this special treatment can be

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

avoided by specifying the PATTERN using the match operator syntax (rather

avoid this by specifying the pattern C</ /> instead of the string C<" ">

than a plain string), thereby allowing only a single space character to be
a delimiter: S<C</ />>.

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

If PATTERN is omitted, then PATTERN defaults to a literal string composed
of a single space character (S<C<' '>>), which invokes the aformentioned
B<awk> emulation.

If omitted, PATTERN defaults to a single space, C<" ">, triggering
the previously described I<awk> emulation.

--tom

p5pRT · 2011-05-24T19:59:11Z

From tchrist@perl.com

As another special case, C<split> emulates the default behavior of the
command line tool B<awk> when the PATTERN is any expression that is
optimized into a I<literal string> composed of a single space character

As another special case, when PATTERN is a single space character as a
string, not a pattern, C<split> emulates I<awk>'s default behavior

--tom

p5pRT · 2011-05-25T16:01:13Z

From mfwitten@gmail.com

Thanks for taking a look, Tom.

On Tue, 24 May 2011 13:54:10 -0600, Tom Christiansen <tchrist@perl.com> wrote:

command line tool B<awk> when the PATTERN is any expression that is
optimized into a I<literal string> composed of a single space character
(such as S<C<' '>> or S<C<' '.''>> or a similar compile-time constant).

such as C<" "> or C<"\x20">

I'll add that as an example.

In this case, any leading whitespace in EXPR is removed before splitting
occurs, and the PATTERN is instead treated as if it were C</\s+/>; in

Leading whitespace in EXPR is ignored, and the PATTERN C</\s+/> is used
for splitting.

YOU would actually probably want to say:

Any stretch of one or more leading white space characters

as per below :-)

In any case, I like my wording better because it is is more explicit
regarding how to think about what's going on.

particular, this means that I<any> whitespace (not just a single space

any stretch of one or more whitespace characters

For one thing, I dislike mixing "one" with "more" because it leaves in limbo
the plurality of the following text; it would be better to write:

any stretch of at least one whitespace character

In any case, I already use `leading whitespace' with that same meaning, and
the parenthetical statement is meant to emphasize the unexpected behavior.
Thus, I'm compelled to stick with the text as written, but I'll add the
word `contiguous' (see below).

character) is used as a delimiter. However, this special treatment can be

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

...

than a plain string), thereby allowing only a single space character to be
a delimiter: S<C</ />>.

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

Actually, I tried to use `delimiter' very strictly throughout this text, and
this term is defined earlier as being the text in EXPR that matches PATTERN:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

Of course, that is actually a very loose definition, and I think I can do
better (for one, the kinds of input for PATTERN should be discussed first).

I'll make an improvement there and then send an email about it.

avoided by specifying the PATTERN using the match operator syntax (rather

avoid this by specifying the pattern C</ /> instead of the string C<" ">

How about incorporating that into the examples? See below.

If PATTERN is omitted, then PATTERN defaults to a literal string composed
of a single space character (S<C<' '>>), which invokes the aformentioned
B<awk> emulation.

If omitted, PATTERN defaults to a single space, C<" ">, triggering
the previously described I<awk> emulation.

How about incorporating that into the awk emulation introduction?

As another special case, C<split> emulates the default behavior of the
command line tool B<awk> when the PATTERN is either omitted or any
expression that is optimized into a I<literal string> composed of a single
space character (such as S<C<' '>> or S<C<"\x20">> or S<C<' '.''>> or a
similarly defined L<compile-time constant|core_perl::constant>, but not
e.g. S<C</ />>). In this case, any leading whitespace in EXPR is removed
before splitting occurs, and the PATTERN is instead treated as if it were
C</\s+/>; in particular, this means that I<any> contiguous whitespace (not
just a single space character) is used as a delimiter.

Is that POD link to `core_perl::constant' correct?

Sincerely,
Michael Witten

p5pRT · 2011-05-25T16:04:48Z

From tchrist@perl.com

Any stretch of one or more leading white space characters

as per below :-)

In any case, I like my wording better because it is is more explicit
regarding how to think about what's going on.

particular, this means that I<any> whitespace (not just a single space

any stretch of one or more whitespace characters

For one thing, I dislike mixing "one" with "more" because it leaves in limbo
the plurality of the following text; it would be better to write:

any stretch of at least one whitespace character

In any case, I already use `leading whitespace' with that same meaning, and
the parenthetical statement is meant to emphasize the unexpected behavior.
Thus, I'm compelled to stick with the text as written, but I'll add the
word `contiguous' (see below).

character) is used as a delimiter. However, this special treatment can be

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

...

than a plain string), thereby allowing only a single space character to be
a delimiter: S<C</ />>.

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

Actually, I tried to use `delimiter' very strictly throughout this text, and
this term is defined earlier as being the text in EXPR that matches PATTERN:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

But that's wrong. A delimiter is a surrounder, not a separator.
Strings are quote-delimited. Literal lists are comma-separated.
These are not at all like each other.

I feel much more strongly about this than I have here expressed,
and I hope I won't need to do so.

--tom

p5pRT · 2011-05-25T16:16:34Z

From mfwitten@gmail.com

On Wed, May 25, 2011 at 16:03, Tom Christiansen <tchrist@perl.com> wrote:

character) is used as a delimiter. However, this special treatment can be

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

...

than a plain string), thereby allowing only a single space character to be
a delimiter: S<C</ />>.

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

Actually, I tried to use `delimiter' very strictly throughout this text, and
this term is defined earlier as being the text in EXPR that matches PATTERN:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

But that's wrong. A delimiter is a surrounder, not a separator.
Strings are quote-delimited. Literal lists are comma-separated.
These are not at all like each other.

Perhaps so. I can buy that; I'll use `separator' then.

I feel much more strongly about this than I have here expressed,
and I hope I won't need to do so.

What on earth do you mean? Are you holding back, good man?

p5pRT · 2011-05-25T16:31:18Z

From tchrist@perl.com

On Wed, May 25, 2011 at 16:03, Tom Christiansen <tchrist@perl.com> wrote:

=C2=A0character) is used as a delimiter. However, this special treatme=
nt can be

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

...

=C2=A0than a plain string), thereby allowing only a single space chara=
cter to be
=C2=A0a delimiter: S<C</ />>.

!!!!!!!!!!!!!!s/delimiter/separator/!!!!!!!!!!!!!!!!!!!!

Actually, I tried to use `delimiter' very strictly throughout this text,=
and
this term is defined earlier as being the text in EXPR that matches PATT=
ERN:

=C2=A0Anything in EXPR that matches PATTERN is taken to be a delimiter
=C2=A0that separates the EXPR into substrings (called "I<fields>") that
=C2=A0do B<not> include the delimiter. Note that a delimiter may be
=C2=A0longer than one character or even have no characters at all (the
=C2=A0empty string, which is a zero-width match).

But that's wrong. =C2=A0A delimiter is a surrounder, not a separator.
Strings are quote-delimited. =C2=A0Literal lists are comma-separated.
These are not at all like each other.

Perhaps so. I can buy that; I'll use `separator' then.

Good. Thanks. Check perlglossary:

alternatives
A list of possible choices from which you may select
only one, as in "Would you like door A, B, or C?"
Alternatives in regular expressions are separated with
a single vertical bar: "|". Alternatives in normal
Perl expressions are separated with a double vertical
bar: "||". Logical alternatives in "Boolean"
expressions are separated with either "||" or "or".

BLOCK
A syntactic construct consisting of a sequence of Perl
statements that is delimited by braces. The "if" and
"while" statements are defined in terms of BLOCKs, for
instance. Sometimes we also say "block" to mean a
lexical scope; that is, a sequence of statements that
act like a "BLOCK", such as within an eval or a file,
even though the statements aren't delimited by braces.

delimiter
A "character" or "string" that sets bounds to an
arbitrarily-sized textual object, not to be confused
with a "separator" or "terminator". "To delimit"
really just means "to surround" or "to enclose" (like
these parentheses are doing).

field
A single piece of numeric or string data that is part
of a longer "string", "record", or "line". Variable-
width fields are usually split up by separators (so
use split to extract the fields), while fixed-width
fields are usually at fixed positions (so use unpack).
Instance variables are also known as fields.

LIST
A syntactic construct representing a comma-separated
list of expressions, evaluated to produce a "list
value". Each "expression" in a "LIST" is evaluated in
"list context" and interpolated into the list value.

separator
A "character" or "string" that keeps two surrounding
strings from being confused with each other. The
split function works on separators. Not to be
confused with delimiters or terminators. The "or" in
the previous sentence separated the two alternatives.

terminator
A "character" or "string" that marks the end of
another string. The $/ variable contains the string
that terminates a readline operation, which chomp
deletes from the end. Not to be confused with
delimiters or separators. The period at the end of
this sentence is a terminator.

I feel much more strongly about this than I have here expressed,
and I hope I won't need to do so.

What on earth do you mean? Are you holding back, good man?

Why certainly, but it's better that way.

But as you insist, here is a taste.

Consider the string:

:foo:bar:

You get a different number of fields if that string is considered
to be colon-separated, -terminated, or -delimited -- according
the standard definitions I have provided above.

"Separator", "terminator", and "delimiter" are three distinct terms for
three orthogonal concepts; add "to separate", "to terminate", and "to
delimit" as the corresponding verbs that go with those nouns.

If you collapse the distinction to make any two of those terms map to
the same concept, what word would you then provide in its stead, eh?
And you would have to do that, because we need to be able to unambiguously
identify which of two distinct concepts we are referring to.

I feel that it is important in technical works to preserve meaningful
technical distinctions, not to erase them with overloaded ambiguities than
they require further elaboration and explanation each time they are used,
and perhaps even the creation of brand new terms in place of the
now-watered-down old ones.

That would just make more work for us all, not less.

--tom

p5pRT · 2011-05-25T16:44:41Z

From mfwitten@gmail.com

On Wed, May 25, 2011 at 16:30, Tom Christiansen <tchrist@perl.com> wrote:

On Wed, May 25, 2011 at 16:03, Tom Christiansen <tchrist@perl.com> wrote:

Actually, I tried to use `delimiter' very strictly throughout this text
and this term is defined earlier as being the text in EXPR that matches
PATTERN:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

But that's wrong. A delimiter is a surrounder, not a separator.
Strings are quote-delimited. Literal lists are comma-separated.
These are not at all like each other.

Perhaps so. I can buy that; I'll use `separator' then.

Good. Thanks. Check perlglossary:

Awesome. Perhaps all of the documentation should link to the glossary
more liberally?

I feel much more strongly about this than I have here expressed,
and I hope I won't need to do so.

What on earth do you mean? Are you holding back, good man?

Why certainly, but it's better that way.

But as you insist, here is a taste.

[a taste]

Outstanding!

I, for one, appreciate your robust expostulation.

p5pRT · 2012-01-06T21:20:52Z

From @cpansprout

On Wed May 25 09:44:41 2011, mfwitten wrote:

On Wed, May 25, 2011 at 16:30, Tom Christiansen <tchrist@perl.com> wrote:

On Wed, May 25, 2011 at 16:03, Tom Christiansen <tchrist@perl.com>
wrote:

Actually, I tried to use `delimiter' very strictly throughout
this text
and this term is defined earlier as being the text in EXPR that
matches
PATTERN:

Anything in EXPR that matches PATTERN is taken to be a delimiter
that separates the EXPR into substrings (called "I<fields>") that
do B<not> include the delimiter. Note that a delimiter may be
longer than one character or even have no characters at all (the
empty string, which is a zero-width match).

But that's wrong. A delimiter is a surrounder, not a separator.
Strings are quote-delimited. Literal lists are comma-separated.
These are not at all like each other.

Perhaps so. I can buy that; I'll use `separator' then.

Good. �Thanks. �Check perlglossary:

Awesome. Perhaps all of the documentation should link to the glossary
more liberally?

I feel much more strongly about this than I have here expressed,
and I hope I won't need to do so.

What on earth do you mean? Are you holding back, good man?

Why certainly, but it's better that way.

But as you insist, here is a taste.

[a taste]

Outstanding!

I, for one, appreciate your robust expostulation.

I’ve taken your patch and incorporated suggestions from others as best I
can. I also omitted the part about things that optimise to a single
space, because I consider that a bug.

I have applied the result as bd46758. Thank you.

--

Father Chrysostomos

p5pRT · 2012-01-06T21:20:52Z

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT closed this as completed Jan 6, 2012

p5pRT added Severity Low documentation hasPatch labels Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PATCH] Docs: perlfunc: Rewrite `split' #11342

[PATCH] Docs: perlfunc: Rewrite `split' #11342

p5pRT commented May 15, 2011

p5pRT commented May 15, 2011

p5pRT commented May 16, 2011

p5pRT commented May 16, 2011

p5pRT commented May 16, 2011

p5pRT commented May 16, 2011

p5pRT commented May 17, 2011

p5pRT commented May 18, 2011

p5pRT commented May 18, 2011

p5pRT commented May 18, 2011

p5pRT commented May 18, 2011

p5pRT commented May 22, 2011

p5pRT commented May 23, 2011

p5pRT commented May 23, 2011

p5pRT commented May 23, 2011

p5pRT commented May 24, 2011

p5pRT commented May 24, 2011

p5pRT commented May 24, 2011

p5pRT commented May 24, 2011

p5pRT commented May 24, 2011

p5pRT commented May 25, 2011

p5pRT commented May 25, 2011

p5pRT commented May 25, 2011

p5pRT commented May 25, 2011

p5pRT commented May 25, 2011

p5pRT commented Jan 6, 2012

p5pRT commented Jan 6, 2012

[PATCH] Docs: perlfunc: Rewrite `split' #11342

[PATCH] Docs: perlfunc: Rewrite `split' #11342

Comments

p5pRT commented May 15, 2011

p5pRT commented May 15, 2011

From mfwitten@gmail.com

p5pRT commented May 16, 2011

From tchrist@perl.com

p5pRT commented May 16, 2011

p5pRT commented May 16, 2011

From bmb@Mail.Libs.UGA.EDU

p5pRT commented May 16, 2011

From tsibley@cpan.org

p5pRT commented May 17, 2011

From mfwitten@gmail.com

p5pRT commented May 18, 2011

From bmb@Mail.Libs.UGA.EDU

p5pRT commented May 18, 2011

From bmb@Mail.Libs.UGA.EDU

p5pRT commented May 18, 2011

From mfwitten@gmail.com

p5pRT commented May 18, 2011

From mfwitten@gmail.com

p5pRT commented May 22, 2011

From @cpansprout

p5pRT commented May 23, 2011

From @ap

p5pRT commented May 23, 2011

From bmb@Mail.Libs.UGA.EDU

p5pRT commented May 23, 2011

From bmb@Mail.Libs.UGA.EDU

p5pRT commented May 24, 2011

From mfwitten@gmail.com

p5pRT commented May 24, 2011

From mfwitten@gmail.com

p5pRT commented May 24, 2011

From mfwitten@gmail.com

p5pRT commented May 24, 2011

From tchrist@perl.com

p5pRT commented May 24, 2011

From tchrist@perl.com

p5pRT commented May 25, 2011

From mfwitten@gmail.com

p5pRT commented May 25, 2011

From tchrist@perl.com

p5pRT commented May 25, 2011

From mfwitten@gmail.com

p5pRT commented May 25, 2011

From tchrist@perl.com

p5pRT commented May 25, 2011

From mfwitten@gmail.com

p5pRT commented Jan 6, 2012

From @cpansprout

p5pRT commented Jan 6, 2012