Skip Menu |
Report information
Id: 122761
Status: resolved
Priority: 0/
Queue: perl5

Owner: Nobody
Requestors: mauke- <l.mai [at] web.de>
Cc:
AdminCc:

Operating System: (no value)
PatchStatus: (no value)
Severity: low
Type: unknown
Perl Version: (no value)
Fixed In: (no value)



Subject: split /\A/ works like /^/m, matches embedded newlines
perldoc perlrebackslash: \A "\A" only matches at the beginning of the string. perldoc -f split: Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field. Therefore split /\A/ should return the input string as is. \A can only match once (at offset 0), which (logically speaking) should turn "foo" into ("", "foo"), but because of the special case in split of not producing empty leading fields for zero-width matches at the beginning, we just get "foo" again. What actually happens: $ perl -wE 'say "[$_]" for split /\A/, "foo\nbar\nbaz"' [foo ] [bar ] [baz] Apparently split thinks /\A/ is the same as /^/m, matching after every embedded newline in the input string. I think this is a bug in split. The test above was with: This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux ... but an IRC bot running 5.20.0 produces the same results so I assume it's still present in 5.20.
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Date: Thu, 11 Sep 2014 16:29:02 +0200
To: Perl5 Porteros <perl5-porters [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 4.9k
On 11 September 2014 14:28, l.mai@web.de <perlbug-followup@perl.org> wrote:
Show quoted text
# New Ticket Created by  l.mai@web.de
# Please include the string:  [perl #122761]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=122761 >


perldoc perlrebackslash:

   \A  "\A" only matches at the beginning of the string.

perldoc -f split:

   Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field.


Therefore split /\A/ should return the input string as is. \A can only match once (at offset 0), which (logically speaking) should turn "foo" into ("", "foo"), but because of the special case in split of not producing empty leading fields for zero-width matches at the beginning, we just get "foo" again.

What actually happens:

$ perl -wE 'say "[$_]" for split /\A/, "foo\nbar\nbaz"'
[foo
]
[bar
]
[baz]

Apparently split thinks /\A/ is the same as /^/m, matching after every embedded newline in the input string. I think this is a bug in split.

The test above was with:
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux

... but an IRC bot running 5.20.0 produces the same results so I assume it's still present in 5.20.


Yes this is still in blead.

I was party to breaking this in 7bd1e61447493a93405e0d15fe2f8a0b6bf71de1 in 2007. (7 years to find the bug in the logic!) But the story is, as usual, much more complicated than that.

This code does NOT use the regex engine for anything other parsing the pattern. The pattern produces an SBOL END regop sequence, which is the same as would be produced for /^/, and triggers the RXf_START_ONLY optimisation case for split.

Part of the problem is that way way way back in the history of Perl, someone decided that split /^/, $string should behave the same as split /^/m, $string.

To explain more /^/m produces a MBOL op, "multi-beginning-of-line", and /^/ produces a SBOL op, "single-beginning-of-line".

And split will and has always treated both the same, as an MBOL, when the pattern was JUST /^/.

Later on in history /\A/ was added as a synonym for /^/, and produces an SBOL op.

When I upgraded the logic in 7bd1e61447 to not look at the pattern *string*, and instead look at the regop structure instead (a much more reliable process), and set flags in the pattern which split would then use, (something required to enable regex engine plug ins to trigger split optimisations), I inadvertently made it so split /\A/ was the same as split /^/ which was always the same thing as split /^/m.

So now we have a problem. There is LOADS of code out there that assumes that

split /^/, $string;

is the correct way to split a string into lines.

However it was only true because of the optimisation in split // did not pay attention to the presence or absence of the /m flag.

So we are now in a jam.

I can do some kind of workaround that makes /\A/ not trigger this optimisation, but basically split is broken by design for these kind of cases.

The naive obvious fix would be to document that split // operates with the m flag set by default, which would explain the unusual behavior of split /^/, but that would break other patterns.

For instance split /^x/ does not act as though there is an implicit /m flag set.

perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^x/, $str'
>>foo
xbar
xbaz
<<
perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^x/m, $str'
>>foo
<<
>>bar
<<
>>baz
<<

Compare with just plain /^/:

perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/m, $str'
>>foo
<<
>>xbar
<<
>>xbaz
<<
perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/, $str'
>>foo
<<
>>xbar
<<
>>xbaz
<<

IOW, split /^/ and split /^/m do the same thing, which they definitely shouldn't.

Given all this I really cant decide what to do. I *could* change the code so that SBOL is exempt from this optimization (or perhaps triggered a different optimisation) but that would break split /^/, on one level I wouldn't mind, as I could argue it is broken already, but in practice I think this would break really a lot of code, and at the very least we would probably want a deprecation cycle. I *could* mess around and figure out a way to distinguish /^/ from /\A/ even though the two should be identical, or I could just say "yeah, split doesnt play by the rules, wont-fix".

A simple work around btw would be to write: /\A|\A/. But that would suck.

I really dont know what to do here. Basically the root of this bug was created probably in the very early history of Perl.

Another alternative would be to introduce a multiline version of \A, say \L for this discussion, and then fix split /\A/ and split /\L/ to do the right thing, and leave split /^/ broken (and document it is broken).

Yves



 
 



--
perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Perl5 Porteros <perl5-porters [...] perl.org>, "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
From: Abigail <abigail [...] abigail.be>
Date: Thu, 11 Sep 2014 18:14:07 +0200
To: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 2.4k
On Thu, Sep 11, 2014 at 04:29:02PM +0200, demerphq wrote: Show quoted text
> > [SNIP] > > So now we have a problem. There is LOADS of code out there that assumes that > > split /^/, $string; > > is the correct way to split a string into lines. > > However it was only true because of the optimisation in split // did not > pay attention to the presence or absence of the /m flag.
I first thought having "split /^/" mean the same as "split /^/m" was done intentionally, as it's documented by "perldoc -f split": A PATTERN of "/^/" is treated as if it were "/^/m", since it isn’t much use otherwise. The third edition of "Programming Perl" documents this behaviour as well -- but not the second edition does not. But looking at some old commits, this may actually not be the case. Commit 2cdd06f700e22243a0f92357f562eb4b13b7197a (Aug 4, 1999/Ilya Zakharevich) makes perl warn on "split /^/", saying it's usage to mean "split /^/m" is deprecated. Commit 46a8fef3c29368ecca588145570275c3556984b0 (Aug 5, 1999/Paul Marquess) turns a "warn" into a "Perl_warner" (with the same message). Then half an hour later, in commit 0e8f60dd43c9e8276bfd6598ee62ebf70fa0c631 (Aug 5, 1999/Jarkko Hietaniemi) the deprecation warning is no longer on by default. And then commit 1ec94568ca12249aa865999d7dec5eb1fa3123c7 (Jul 25, 2000/M. J. T. Guy) documents the current behaviour. The patch says "with notes from tchrist and gbarr", and it was the summer of 2000 that people were working on the third edition of "Programming Perl". I haven't checked the mail archives to see whether that was any discussion. Show quoted text
> > So we are now in a jam. > > I can do some kind of workaround that makes /\A/ not trigger this > optimisation, but basically split is broken by design for these kind of > cases. > > The naive obvious fix would be to document that split // operates with the > m flag set by default, which would explain the unusual behavior of split > /^/, but that would break other patterns.
As I said, /^/ implying a /m has already been documented for 14 years. Show quoted text
> [SNIP] > > Another alternative would be to introduce a multiline version of \A, say \L > for this discussion, and then fix split /\A/ and split /\L/ to do the right > thing, and leave split /^/ broken (and document it is broken).
My suggestion: leave it as is, and document it. How useful is it to be able to write: split /\A/ => $foo; when you could have written $foo; instead? Fixing it to do the "right thing" seems like a whole lot of work for little benefit. Abigail
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 835b
Am Do 11. Sep 2014, 07:29:35, demerphq schrieb: Show quoted text
> > I really dont know what to do here. Basically the root of this bug was > created probably in the very early history of Perl.
My first idea would be to revert the opcode checking and go back to the pattern source; i.e. do the equivalent of $src eq "^". That would keep backwards compatibility with existing code and the letter of the documentation ("If PATTERN is /^/, ..."). It would also make \A "work" (i.e. do nothing) again. Then I'd add a deprecation note to the documentation; something like: If PATTERN is /^/, then it is treated as if it used the multiline modifier (/^/m). However, this special case is deprecated. Always use /^/m in new code. That leaves the door open to actual deprecation warnings if we decide to remove this feature in a future release.
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Date: Thu, 11 Sep 2014 18:36:52 +0200
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 949b
On 11 September 2014 18:27, l.mai@web.de via RT <perlbug-followup@perl.org> wrote:
Show quoted text
Am Do 11. Sep 2014, 07:29:35, demerphq schrieb:
>
> I really dont know what to do here. Basically the root of this bug was
> created probably in the very early history of Perl.

My first idea would be to revert the opcode checking and go back to the pattern source; i.e. do the equivalent of $src eq "^". That would keep backwards compatibility with existing code and the letter of the documentation ("If PATTERN is /^/, ..."). It would also make \A "work" (i.e. do nothing) again.

 
FWIW, I am really against using the raw pattern.

For instance I expect:

split /(?:)^/

to be the same as

split /^(?:)/

to be the same as

split /^/

to be the same as

split /#this splits lines out without capturing the line break
 ^
 #end of comment
/x

I fixed a bunch of issues like this when I redid this code. I am really against changing back.

Yves 
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>, "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Date: Thu, 11 Sep 2014 12:59:08 -0400
To: demerphq <demerphq [...] gmail.com>
From: Eric Brine <ikegami [...] adaelis.com>
Download (untitled) / with headers
text/plain 5.7k
split already says:

If PATTERN is /^/ , then it is treated as if it used the multiline modifier (/^/m ), since it isn't much use otherwise.

How about

If PATTERN is /^/ or /\A/, then it is treated as if it used the multiline modifier (/^/m ), since it isn't much use otherwise. Preprending (?:) (e.g. /(?:)^/) sufficiently alters the pattern to restore the normal regex behaviour.








On Thu, Sep 11, 2014 at 10:29 AM, demerphq <demerphq@gmail.com> wrote:
Show quoted text
On 11 September 2014 14:28, l.mai@web.de <perlbug-followup@perl.org> wrote:
# New Ticket Created by  l.mai@web.de
# Please include the string:  [perl #122761]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=122761 >


perldoc perlrebackslash:

   \A  "\A" only matches at the beginning of the string.

perldoc -f split:

   Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field.


Therefore split /\A/ should return the input string as is. \A can only match once (at offset 0), which (logically speaking) should turn "foo" into ("", "foo"), but because of the special case in split of not producing empty leading fields for zero-width matches at the beginning, we just get "foo" again.

What actually happens:

$ perl -wE 'say "[$_]" for split /\A/, "foo\nbar\nbaz"'
[foo
]
[bar
]
[baz]

Apparently split thinks /\A/ is the same as /^/m, matching after every embedded newline in the input string. I think this is a bug in split.

The test above was with:
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux

... but an IRC bot running 5.20.0 produces the same results so I assume it's still present in 5.20.


Yes this is still in blead.

I was party to breaking this in 7bd1e61447493a93405e0d15fe2f8a0b6bf71de1 in 2007. (7 years to find the bug in the logic!) But the story is, as usual, much more complicated than that.

This code does NOT use the regex engine for anything other parsing the pattern. The pattern produces an SBOL END regop sequence, which is the same as would be produced for /^/, and triggers the RXf_START_ONLY optimisation case for split.

Part of the problem is that way way way back in the history of Perl, someone decided that split /^/, $string should behave the same as split /^/m, $string.

To explain more /^/m produces a MBOL op, "multi-beginning-of-line", and /^/ produces a SBOL op, "single-beginning-of-line".

And split will and has always treated both the same, as an MBOL, when the pattern was JUST /^/.

Later on in history /\A/ was added as a synonym for /^/, and produces an SBOL op.

When I upgraded the logic in 7bd1e61447 to not look at the pattern *string*, and instead look at the regop structure instead (a much more reliable process), and set flags in the pattern which split would then use, (something required to enable regex engine plug ins to trigger split optimisations), I inadvertently made it so split /\A/ was the same as split /^/ which was always the same thing as split /^/m.

So now we have a problem. There is LOADS of code out there that assumes that

split /^/, $string;

is the correct way to split a string into lines.

However it was only true because of the optimisation in split // did not pay attention to the presence or absence of the /m flag.

So we are now in a jam.

I can do some kind of workaround that makes /\A/ not trigger this optimisation, but basically split is broken by design for these kind of cases.

The naive obvious fix would be to document that split // operates with the m flag set by default, which would explain the unusual behavior of split /^/, but that would break other patterns.

For instance split /^x/ does not act as though there is an implicit /m flag set.

perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^x/, $str'
>>foo
xbar
xbaz
<<
perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^x/m, $str'
>>foo
<<
>>bar
<<
>>baz
<<

Compare with just plain /^/:

perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/m, $str'
>>foo
<<
>>xbar
<<
>>xbaz
<<
perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/, $str'
>>foo
<<
>>xbar
<<
>>xbaz
<<

IOW, split /^/ and split /^/m do the same thing, which they definitely shouldn't.

Given all this I really cant decide what to do. I *could* change the code so that SBOL is exempt from this optimization (or perhaps triggered a different optimisation) but that would break split /^/, on one level I wouldn't mind, as I could argue it is broken already, but in practice I think this would break really a lot of code, and at the very least we would probably want a deprecation cycle. I *could* mess around and figure out a way to distinguish /^/ from /\A/ even though the two should be identical, or I could just say "yeah, split doesnt play by the rules, wont-fix".

A simple work around btw would be to write: /\A|\A/. But that would suck.

I really dont know what to do here. Basically the root of this bug was created probably in the very early history of Perl.

Another alternative would be to introduce a multiline version of \A, say \L for this discussion, and then fix split /\A/ and split /\L/ to do the right thing, and leave split /^/ broken (and document it is broken).

Yves



 
 



--
perl -Mre=debug -e "/just|another|perl|hacker/"

To: "l.mai [...] web.de via RT" <perlbug-followup [...] perl.org>
Date: Thu, 11 Sep 2014 19:36:38 +0200
From: Abigail <abigail [...] abigail.be>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: "OtherRecipients of perl Ticket #122761": ;, perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.3k
On Thu, Sep 11, 2014 at 09:27:36AM -0700, l.mai@web.de via RT wrote: Show quoted text
> Am Do 11. Sep 2014, 07:29:35, demerphq schrieb:
> > > > I really dont know what to do here. Basically the root of this bug was > > created probably in the very early history of Perl.
> > My first idea would be to revert the opcode checking and go back to the pattern source; i.e. do the equivalent of $src eq "^". That would keep backwards compatibility with existing code and the letter of the documentation ("If PATTERN is /^/, ..."). It would also make \A "work" (i.e. do nothing) again. > > Then I'd add a deprecation note to the documentation; something like: > > If PATTERN is /^/, then it is treated as if it used the multiline > modifier (/^/m). However, this special case is deprecated. Always > use /^/m in new code. > > That leaves the door open to actual deprecation warnings if we decide to remove this feature in a future release. >
As can been seen in my other post, we did this back in 1999. Then quickly turned off the warning by default. And then a year later, just documented the behaviour. It has been documented to work like this for 14 years now, more than half the life time of Perl. Considering the uselessness of splitting on just the beginning of the string (which is effectively a noop), I do not think there's anything significant to be gained by deprecating this. Abigail
Date: Thu, 11 Sep 2014 19:36:42 +0200
To: Eric Brine <ikegami [...] adaelis.com>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>, "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Download (untitled) / with headers
text/plain 1.1k
On 11 September 2014 18:59, Eric Brine <ikegami@adaelis.com> wrote:
Show quoted text
split already says:

If PATTERN is /^/ , then it is treated as if it used the multiline modifier (/^/m ), since it isn't much use otherwise.

How about

If PATTERN is /^/ or /\A/, then it is treated as if it used the multiline modifier (/^/m ), since it isn't much use otherwise. Preprending (?:) (e.g. /(?:)^/) sufficiently alters the pattern to restore the normal regex behaviour.


I really really really hate the idea that prepending (?:) to the pattern should change what it does. It is completely counterintuitive. Like 

foo((),$thing);

being different from

foo($thing);

And like I said we had a bunch of bug reports along those lines.

The real issue here is that the /^/ implies /m in split thing was not properly thought out and should never have been done. Why should 
C<split /^/, $string> be different from C<split /^x/, $string>? 

This is yet another example of how "ooh neat" features, especially in the regex engine almost *always* cause trouble that is nearly impossible to resolve after the fact.

Yves
 
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
To: perl5-porters [...] perl.org
Date: Thu, 11 Sep 2014 13:54:57 -0400
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 768b
* demerphq <demerphq@gmail.com> [2014-09-11T10:29:02] Show quoted text
> I *could* mess around and figure out a way to distinguish /^/ from /\A/ even > though the two should be identical, or I could just say "yeah, split doesnt > play by the rules, wont-fix".
First off: thanks for this post, which was interesting and useful. It seems to me that the above is a subset of the below: Show quoted text
> Another alternative would be to introduce a multiline version of \A, say \L > for this discussion, and then fix split /\A/ and split /\L/ to do the right > thing, and leave split /^/ broken (and document it is broken).
That is: you need to distinguish ^ from \A, whether or not you add \L, for such a fix. Before I get into anything else: is that an accurate reading of the situation? -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

Date: Thu, 11 Sep 2014 20:20:28 +0200
To: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 3.2k
On 11 September 2014 19:54, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
Show quoted text
* demerphq <demerphq@gmail.com> [2014-09-11T10:29:02]
> I *could* mess around and figure out a way to distinguish /^/ from /\A/ even
> though the two should be identical, or I could just say "yeah, split doesnt
> play by the rules, wont-fix".

First off: thanks for this post, which was interesting and useful.


No problem. Especially as I was indirectly responsible for part of the mess.
 
Show quoted text
It seems to me that the above is a subset of the below:

> Another alternative would be to introduce a multiline version of \A, say \L
> for this discussion, and then fix split /\A/ and split /\L/ to do the right
> thing, and leave split /^/ broken (and document it is broken).

That is: you need to distinguish ^ from \A, whether or not you add \L, for such
a fix.  Before I get into anything else: is that an accurate reading of the
situation?

Er, sort of. What you describe is option 2 below.

Thinking about this more I think there are two reasonable options:

1. document that all patterns to split are compiled under /m by default. At the same time we would change  the optimisation for /^/ to detect MBOL, which is what is produced by split /^/m, $string. This would then mean that only /^/ would trigger the optimisation as it would produce a MBOL, and \A would be fine because it produces an SBOL.

2. use the flag field of the regop to store whether the SBOL comes from \A or ^, and then only enable the /^/ optimisation when it was /^/.

Personally the more I think about this more i think that 1 is better, even though it is probably the riskier of the two. Having said that I am speaking hypothetically, option 1 *might* break something, but I struggle to think what, whereas option 2 would leave lots of things "broken" that are already "broken" and would fix this single case only, without breaking anything else.

Consider what option 1 would result in:

  split /^/, $string
  split /^x/, $string

would behave the same as far as the ^ operator goes. And it would mean that

split /^/, $string
split /$/, $string
spit /\Z/, $string

would behave similarly (that is match all the beginning or end of lines in the string). That they dont IMO is pretty wrong. The justification applied to make split /^/ work IMO applies just as much to split /$/ or split /\z/.

And it would fix the problem with /\A/ behaving like /^/m (which is uncontroversially wrong).

And when I think about what it would break I struggle to think of something. Does anything come to mind to anyone else? 

Also the other nice thing about option one is it doesnt need an \L metacharacter, which I proposed only because we would have no way to say "I really want to split on the beginning of the string". Whereas if we defaulted split compilation to enable /m then we could turn it off easily:

split /(?-m:^)/, $string

would disable the defaut /m flag. The reason I proposed the \L metacharacter is there is no way to turn off a flag from the outside of the pattern.

In fact the process of writing this email I have become sufficiently convinced that option /m is the right thing to do and that I will start writing the patch now so we can find out if it breaks anything.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Date: Thu, 11 Sep 2014 22:45:16 +0200
To: perl5-porters [...] perl.org
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 348b
* demerphq <demerphq@gmail.com> [2014-09-11 20:25]: Show quoted text
> Whereas if we defaulted split compilation to enable /m then we could > turn it off easily: > > split /(?-m:^)/, $string > > would disable the defaut /m flag.
Would writing it `split qr/^/, $string` also work? (I would hope yes.) Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Date: Thu, 11 Sep 2014 22:54:33 +0200
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 861b
On 11 September 2014 22:45, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
Show quoted text
* demerphq <demerphq@gmail.com> [2014-09-11 20:25]:
> Whereas if we defaulted split compilation to enable /m then we could
> turn it off easily:
>
> split /(?-m:^)/, $string
>
> would disable the defaut /m flag.

Would writing it `split qr/^/, $string` also work? (I would hope yes.)

No, when Karl changed qr/^/ to reduce down to (?^:^) he changed the semantics of such a case so we would not disable the /m flag, and this would behave just the same as split /^/, $_.  I think anyway.

IOW, (?^:^) means "match /^/ under whatever rules the pattern is compiled in".

In the older perls it would turn into (?-msix:^) and then yes I think it would have behaved as you expect.

Win-some, lose-some.

cheers,
Yves



 

--
perl -Mre=debug -e "/just|another|perl|hacker/"
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Date: Fri, 12 Sep 2014 00:19:20 +0200
To: perl5-porters [...] perl.org
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 653b
* demerphq <demerphq@gmail.com> [2014-09-11 22:55]: Show quoted text
> No, when Karl changed qr/^/ to reduce down to (?^:^) he changed the > semantics of such a case so we would not disable the /m flag, and this > would behave just the same as split /^/, $_.
It does. I am assuming that this special compilation context applies at the time that split compiles the pattern, which implies that if split were made to not stringify qr objects, then it would not apply to them. Correct? If so – is that doable with reasonable effort? It would go some ways toward regularising split’s behaviour further. Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 00:58:29 +0200
From: Abigail <abigail [...] abigail.be>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 543b
On Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote: Show quoted text
> > Thinking about this more I think there are two reasonable options: > > 1. document that all patterns to split are compiled under /m by default.
To do that, you would first have to change the behavior of split, as it currently does *NOT* do this. Only for /^/. Witness: $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"' [foo ] [bar ] [baz] $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"' [foo abar abaz] Abigail
To: Abigail <abigail [...] abigail.be>
Date: Fri, 12 Sep 2014 01:09:27 +0200
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 1.1k
On 12 September 2014 00:58, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote:
>
> Thinking about this more I think there are two reasonable options:
>
> 1. document that all patterns to split are compiled under /m by default.

To do that, you would first have to change the behavior of split, as
it currently does *NOT* do this. Only for /^/. Witness:

    $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"'
    [foo
    ]
    [bar
    ]
    [baz]

    $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"'
    [foo
    abar
    abaz]

Yes, I have said exactly the same thing multiple times in this thread.

And to me its actually exactly the reason we *should* do this. I consider the inconsistency here to be *most* undesirable.

As I said elsewhere in this thread, why should split /$/ not have the same "special" rule applied? I find the extreme differences in the following  to be *most* surprising.

$ perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /^/, $str'
>>foo
<<
>>xbar
<<
>>xbaz
<<
$ perl -le'my $str="foo\nxbar\nxbaz\n"; print ">>$_<<" for split /$/, $str'
>>foo
xbar
xbaz<<
>>
<<

To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 01:47:17 +0200
From: Abigail <abigail [...] abigail.be>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 1.7k
On Fri, Sep 12, 2014 at 01:09:27AM +0200, demerphq wrote: Show quoted text
> On 12 September 2014 00:58, Abigail <abigail@abigail.be> wrote: >
> > On Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote:
> > > > > > Thinking about this more I think there are two reasonable options: > > > > > > 1. document that all patterns to split are compiled under /m by default.
> > > > To do that, you would first have to change the behavior of split, as > > it currently does *NOT* do this. Only for /^/. Witness: > > > > $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"' > > [foo > > ] > > [bar > > ] > > [baz] > > > > $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"' > > [foo > > abar > > abaz] > >
> > Yes, I have said exactly the same thing multiple times in this thread. > > And to me its actually exactly the reason we *should* do this. I consider > the inconsistency here to be *most* undesirable. > > As I said elsewhere in this thread, why should split /$/ not have the same > "special" rule applied? I find the extreme differences in the following to > be *most* surprising.
Because noone uses /$/m to split a multiline string into individual lines, as that leaves you with strings starting with a newline. Giving /$/ a special rule just means an extra testcase, and another thing to use in obfuscated code, but it won't be useful for most people (it will also be harmless). I'm still figuring out what problem needs solving. Is it really a problem that split /\A/, "multiline string"; splits as /^/m? Is splitting on the beginning of the string, resulting in a one-element list consisting of the string itself so useful we want to overhaul how split is working? Can't we just document this exception? Abigail
Date: Fri, 12 Sep 2014 05:19:49 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 2.2k
On 12 September 2014 01:47, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 01:09:27AM +0200, demerphq wrote:
> On 12 September 2014 00:58, Abigail <abigail@abigail.be> wrote:
>
> > On Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote:
> > >
> > > Thinking about this more I think there are two reasonable options:
> > >
> > > 1. document that all patterns to split are compiled under /m by default.
> >
> > To do that, you would first have to change the behavior of split, as
> > it currently does *NOT* do this. Only for /^/. Witness:
> >
> >     $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"'
> >     [foo
> >     ]
> >     [bar
> >     ]
> >     [baz]
> >
> >     $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"'
> >     [foo
> >     abar
> >     abaz]
> >
>
> Yes, I have said exactly the same thing multiple times in this thread.
>
> And to me its actually exactly the reason we *should* do this. I consider
> the inconsistency here to be *most* undesirable.
>
> As I said elsewhere in this thread, why should split /$/ not have the same
> "special" rule applied? I find the extreme differences in the following  to
> be *most* surprising.


Because noone uses /$/m to split a multiline string into individual lines,
as that leaves you with strings starting with a newline.  Giving /$/
a special rule just means an extra testcase, and another thing to use
in obfuscated code, but it won't be useful for most people (it will also
be harmless).


I'm still figuring out what problem needs solving. Is it really a problem
that

    split /\A/, "multiline string";

splits as /^/m? Is splitting on the beginning of the string, resulting in a
one-element list consisting of the string itself so useful we want to
overhaul how split is working?

Can't we just document this exception?

I dont like the exceptions here, and I find the inconsistency to be very confusing. Regexes are hard enough without them having weird inconsistencies like the ones we have.

My intent is to make split default to /m enabled which I believe is the right and complete way to have done this (mis)feature in the first place.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
From: Abigail <abigail [...] abigail.be>
To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 10:17:03 +0200
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 2.7k
On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote: Show quoted text
> On 12 September 2014 01:47, Abigail <abigail@abigail.be> wrote: >
> > On Fri, Sep 12, 2014 at 01:09:27AM +0200, demerphq wrote:
> > > On 12 September 2014 00:58, Abigail <abigail@abigail.be> wrote: > > >
> > > > On Thu, Sep 11, 2014 at 08:20:28PM +0200, demerphq wrote:
> > > > > > > > > > Thinking about this more I think there are two reasonable options: > > > > > > > > > > 1. document that all patterns to split are compiled under /m by
> > default.
> > > > > > > > To do that, you would first have to change the behavior of split, as > > > > it currently does *NOT* do this. Only for /^/. Witness: > > > > > > > > $ perl -E 'say "[$_]" for split /^a/m => "foo\nabar\nabaz"' > > > > [foo > > > > ] > > > > [bar > > > > ] > > > > [baz] > > > > > > > > $ perl -E 'say "[$_]" for split /^a/ => "foo\nabar\nabaz"' > > > > [foo > > > > abar > > > > abaz] > > > >
> > > > > > Yes, I have said exactly the same thing multiple times in this thread. > > > > > > And to me its actually exactly the reason we *should* do this. I consider > > > the inconsistency here to be *most* undesirable. > > > > > > As I said elsewhere in this thread, why should split /$/ not have the
> > same
> > > "special" rule applied? I find the extreme differences in the following
> > to
> > > be *most* surprising.
> > > > > > Because noone uses /$/m to split a multiline string into individual lines, > > as that leaves you with strings starting with a newline. Giving /$/ > > a special rule just means an extra testcase, and another thing to use > > in obfuscated code, but it won't be useful for most people (it will also > > be harmless). > > > > > > I'm still figuring out what problem needs solving. Is it really a problem > > that > > > > split /\A/, "multiline string"; > > > > splits as /^/m? Is splitting on the beginning of the string, resulting in a > > one-element list consisting of the string itself so useful we want to > > overhaul how split is working? > > > > Can't we just document this exception? > >
> > I dont like the exceptions here, and I find the inconsistency to be very > confusing. Regexes are hard enough without them having weird > inconsistencies like the ones we have.
But split is already full of exceptions: * Any pattern matching the empty string is special cased. * // is even doubly special cased. * " " is special cased (but / / isn't) Show quoted text
> My intent is to make split default to /m enabled which I believe is the > right and complete way to have done this (mis)feature in the first place.
Really? For what purpose? You'd potentially break code, and it won't fix the issue of /\A/ acting like /^/m, because even with /m, /\A/ isn't supposed to match any internal newlines anyway. That's the entire point of /\A/. Abigail
To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 13:07:31 +0200
From: Abigail <abigail [...] abigail.be>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 583b
On Fri, Sep 12, 2014 at 10:17:03AM +0200, Abigail wrote: Show quoted text
> > Really? For what purpose? You'd potentially break code, and it won't > fix the issue of /\A/ acting like /^/m, because even with /m, /\A/ > isn't supposed to match any internal newlines anyway. That's the entire > point of /\A/.
Having said that, the only code effected by such a change is a split on a multiline string, with a pattern which includes either /^/ or /$/, other than a lone /^/. And since /^PAT/ is pretty useless without a /m, I doubt there's a lot of code that's effected by such a change. Abigail
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Date: Fri, 12 Sep 2014 14:19:28 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 774b
On 12 September 2014 13:07, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 10:17:03AM +0200, Abigail wrote:
>
> Really? For what purpose? You'd potentially break code, and it won't
> fix the issue of /\A/ acting like /^/m, because even with /m, /\A/
> isn't supposed to match any internal newlines anyway. That's the entire
> point of /\A/.


Having said that, the only code effected by such a change is a split
on a multiline string, with a pattern which includes either /^/ or /$/,
other than a lone /^/. And since /^PAT/ is pretty useless without a /m,
I doubt there's a lot of code that's effected by such a change.

Indeed. Exactly the same conclusion I came to as well.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"
Date: Fri, 12 Sep 2014 14:21:56 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 2.2k
On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:

[snip] 
Show quoted text
> I dont like the exceptions here, and I find the inconsistency to be very
> confusing. Regexes are hard enough without them having weird
> inconsistencies like the ones we have.

But split is already full of exceptions:

  * Any pattern matching the empty string is special cased.

I don't know if I agree here. Part of this behavior is the default behaviour for how an empty pattern matches

perl -le'my $str="abcdef"; while($str=~//g) { print substr($str,$-[0],1) }'
a
b
c
d
e
f

 
Show quoted text
  * // is even doubly special cased.

Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall what you mean. (Which in itself is a good reason to eliminate as many of the inconsistencies as possible.)
 
Show quoted text
  * " " is special cased (but / / isn't)

I consider the inability to simulate " " using a qr//  or // a bug, and it is on my todo list to fix. I don't consider one bug an excuse not to fix other bugs.
 
Show quoted text

> My intent is to make split default to /m enabled which I believe is the
> right and complete way to have done this (mis)feature in the first place.


Really? For what purpose?

Consistency in behaviour of things like split /^/ and split /^x/ at the very least.
 
Show quoted text
You'd potentially break code,

Lets find out if that is FUD or Fact. As far as I can tell the only code that might be affected would be something like split /$/ which you are already on record of saying nobody uses.
 
Show quoted text
and it won't fix the issue of /\A/ acting like /^/m, because even with /m, /\A/
isn't supposed to match any internal newlines anyway. That's the entire
point of /\A/.

Yes it does.

/^/       => SBOL
/^/m    => MBOL
/\A/     => SBOL
/\A/m  => SBOL

The equivalence of /^/ and /^/m is afforded by the following code:

        else if (PL_regkind[fop] == BOL && nop == END)

if we change the default of split to /m and that code is changed to:

        else if (fop == MBOL && nop == END)

then 

split /^/, => MBOL
split /\A/, => SBOL

which fixes the bug in this thread, and make splits behaviour consistent with regular patterns.

Yves
ps: Abigail sorry for the dupe, I accidentally replied to you direct instead of "to-all".
From: Abigail <abigail [...] abigail.be>
To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 14:50:25 +0200
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.6k
On Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote: Show quoted text
> On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote: >
> > On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote: > > > > [snip]
>
> > > I dont like the exceptions here, and I find the inconsistency to be very > > > confusing. Regexes are hard enough without them having weird > > > inconsistencies like the ones we have.
> > > > But split is already full of exceptions: > > > > * Any pattern matching the empty string is special cased. > >
> > I don't know if I agree here. Part of this behavior is the default > behaviour for how an empty pattern matches > > perl -le'my $str="abcdef"; while($str=~//g) { print substr($str,$-[0],1) }' > a > b > c > d > e > f > > >
> > * // is even doubly special cased. > >
> > Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall what > you mean. (Which in itself is a good reason to eliminate as many of the > inconsistencies as possible.)
From the split doc entry: As a special case for "split", the empty pattern given in match operator syntax ("//") specifically matches the empty string, which is contrary to its usual interpretation as the last successful match. So it's special cased to get to not mean the last succesful match, but to be the empty string. Show quoted text
> > * " " is special cased (but / / isn't)
> > I consider the inability to simulate " " using a qr// or // a bug, and it > is on my todo list to fix. I don't consider one bug an excuse not to fix > other bugs.
How do you propose to "fix" that? Both C<< split " " >> and C<< split / / >> are quite frequent, and useful. Abigail
To: perl5-porters [...] perl.org
Date: Fri, 12 Sep 2014 14:57:01 +0200
From: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.8k
* Abigail <abigail@abigail.be> [2014-09-11 18:15]: Show quoted text
> How useful is it to be able to write: > > split /\A/ => $foo; > > when you could have written > > $foo; > > instead?
There are quite a few APIs that use regexps as a sort of DSL. It’s not hard to imagine a data munging function that does a split internally but expects/allows you to specify the delimiter to split on. And on occasion you may then have need to make the function not split the string at all, in which case you require some kind of pattern that can turn split into an identity function. * Abigail <abigail@abigail.be> [2014-09-12 10:20]: Show quoted text
> On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:
> > My intent is to make split default to /m enabled which I believe is > > the right and complete way to have done this (mis)feature in the > > first place.
> > Really? For what purpose? You'd potentially break code, and it won't > fix the issue of /\A/ acting like /^/m, because even with /m, /\A/ > isn't supposed to match any internal newlines anyway. That's the > entire point of /\A/.
Well yes, the entire point of this thread is the idea that split /\A/ should not behave like split /^/. * demerphq <demerphq@gmail.com> [2014-09-12 14:25]: Show quoted text
> On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
> > * // is even doubly special cased.
> > Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall > what you mean. (Which in itself is a good reason to eliminate as many > of the inconsistencies as possible.)
split //, "foobar" # yields qw( f o o b a r ) Normally an empty match reuses the last pattern but here it really means an empty match. Maybe the sense in which Abigail is calling it doubly special-cased is that it is normally special-cased in the RE engine, but split, as a special-case in turn, removes that special-case treatment? Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/>
From: Abigail <abigail [...] abigail.be>
To: Aristotle Pagaltzis <pagaltzis [...] gmx.de>
Date: Fri, 12 Sep 2014 15:52:12 +0200
CC: perl5-porters [...] perl.org
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 2.3k
On Fri, Sep 12, 2014 at 02:57:01PM +0200, Aristotle Pagaltzis wrote: Show quoted text
> * Abigail <abigail@abigail.be> [2014-09-11 18:15]:
> > How useful is it to be able to write: > > > > split /\A/ => $foo; > > > > when you could have written > > > > $foo; > > > > instead?
> > There are quite a few APIs that use regexps as a sort of DSL. It’s not > hard to imagine a data munging function that does a split internally but > expects/allows you to specify the delimiter to split on. And on occasion > you may then have need to make the function not split the string at all, > in which case you require some kind of pattern that can turn split into > an identity function.
Sure, a niche case, and one for which /\A/ isn't the only option. (I would use /(*FAIL)/ or /(?!)/, as that makes the intent more clear). Show quoted text
> > > * Abigail <abigail@abigail.be> [2014-09-12 10:20]:
> > On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:
> > > My intent is to make split default to /m enabled which I believe is > > > the right and complete way to have done this (mis)feature in the > > > first place.
> > > > Really? For what purpose? You'd potentially break code, and it won't > > fix the issue of /\A/ acting like /^/m, because even with /m, /\A/ > > isn't supposed to match any internal newlines anyway. That's the > > entire point of /\A/.
> > Well yes, the entire point of this thread is the idea that split /\A/ > should not behave like split /^/. > > > * demerphq <demerphq@gmail.com> [2014-09-12 14:25]:
> > On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
> > > * // is even doubly special cased.
> > > > Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall > > what you mean. (Which in itself is a good reason to eliminate as many > > of the inconsistencies as possible.)
> > split //, "foobar" # yields qw( f o o b a r ) > > Normally an empty match reuses the last pattern but here it really means > an empty match. Maybe the sense in which Abigail is calling it doubly > special-cased is that it is normally special-cased in the RE engine, but > split, as a special-case in turn, removes that special-case treatment?
Double special cased as in "not acting like the normal //, but acting as an empty string", and "patterns matching the empty string are special cased", although Yves gives convincing evidence the latter isn't all that special cased. Abigail
From: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 16:16:13 +0200
To: Abigail <abigail [...] abigail.be>
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 2.6k
On 12 September 2014 14:50, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote:
> On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
>
> > On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:
> >
> > [snip]
>
> > > I dont like the exceptions here, and I find the inconsistency to be very
> > > confusing. Regexes are hard enough without them having weird
> > > inconsistencies like the ones we have.
> >
> > But split is already full of exceptions:
> >
> >   * Any pattern matching the empty string is special cased.
> >
>
> I don't know if I agree here. Part of this behavior is the default
> behaviour for how an empty pattern matches
>
> perl -le'my $str="abcdef"; while($str=~//g) { print substr($str,$-[0],1) }'
> a
> b
> c
> d
> e
> f
>
>
>
> >   * // is even doubly special cased.
> >
>
> Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall what
> you mean. (Which in itself is a good reason to eliminate as many of the
> inconsistencies as possible.)


From the split doc entry:

    As a special case for "split", the empty pattern given in match
    operator syntax ("//") specifically matches the empty string,
    which is contrary to its usual interpretation as the last
    successful match.

So it's special cased to get to not mean the last succesful match,
but to be the empty string.


Oh that.  Right. That isn't a special case in split, its a special case in the m// and s/// operator that isn't present in split nor is it in qr//.

$ perl -le'"foo"=~/(.*)/ and print $1; print qr//'
foo
(?^:)

Also I thought we decided that that feature wasn't very useful and were going to deprecate it. :-)

 
Show quoted text

> >   * " " is special cased (but / / isn't)
>
> I consider the inability to simulate " " using a qr//  or // a bug, and it
> is on my todo list to fix. I don't consider one bug an excuse not to fix
> other bugs.


How do you propose to "fix" that? Both C<< split " " >> and C<< split / / >>
are quite frequent, and useful.

split qr/(*SPLIT_WHITE)/, $string

is my working plan. I fixed one issue related to this, I think in 5.18, where there was no way at all to parametrically get the split white behavior. Now you can do this:

./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for split $str, $foo'
>foo<
>bar<

But I think one should be able to do this with a qr// object as well.

Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a normal regex, and produce the split white special case behavior in a split pattern.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"
From: demerphq <demerphq [...] gmail.com>
To: Abigail <abigail [...] abigail.be>
Date: Fri, 12 Sep 2014 16:26:27 +0200
CC: Aristotle Pagaltzis <pagaltzis [...] gmx.de>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 3.6k
On 12 September 2014 15:52, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 02:57:01PM +0200, Aristotle Pagaltzis wrote:
> * Abigail <abigail@abigail.be> [2014-09-11 18:15]:
> > How useful is it to be able to write:
> >
> >     split /\A/ => $foo;
> >
> > when you could have written
> >
> >     $foo;
> >
> > instead?
>
> There are quite a few APIs that use regexps as a sort of DSL. It’s not
> hard to imagine a data munging function that does a split internally but
> expects/allows you to specify the delimiter to split on. And on occasion
> you may then have need to make the function not split the string at all,
> in which case you require some kind of pattern that can turn split into
> an identity function.


Sure, a niche case, and one for which /\A/ isn't the only option.
(I would use /(*FAIL)/ or /(?!)/, as that makes the intent more clear).


FWIW, I am only moderately interested in fixing the split /\A/ behaviour in of itself. IOW, if it turns out the /m default "just wont work", then I would be fine with saying "wont-fix". However since I believe the /m default resolves a bunch of inconsistencies AND fixes the split /\A/ case I am quite interested in getting that done.
 
Show quoted text
>
>
> * Abigail <abigail@abigail.be> [2014-09-12 10:20]:
> > On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote:
> > > My intent is to make split default to /m enabled which I believe is
> > > the right and complete way to have done this (mis)feature in the
> > > first place.
> >
> > Really? For what purpose? You'd potentially break code, and it won't
> > fix the issue of /\A/ acting like /^/m, because even with /m, /\A/
> > isn't supposed to match any internal newlines anyway. That's the
> > entire point of /\A/.
>
> Well yes, the entire point of this thread is the idea that split /\A/
> should not behave like split /^/.
>
>
> * demerphq <demerphq@gmail.com> [2014-09-12 14:25]:
> > On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
> > >   * // is even doubly special cased.
> >
> > Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall
> > what you mean. (Which in itself is a good reason to eliminate as many
> > of the inconsistencies as possible.)
>
>     split //, "foobar" # yields qw( f o o b a r )
>
> Normally an empty match reuses the last pattern but here it really means
> an empty match. Maybe the sense in which Abigail is calling it doubly
> special-cased is that it is normally special-cased in the RE engine, but
> split, as a special-case in turn, removes that special-case treatment?


Double special cased as in "not acting like the normal //, but acting as
an empty string", and "patterns matching the empty string are special cased",
although Yves gives convincing evidence the latter isn't all that special
cased.

Sorry to repeat a previous mail, but IMO it is not that split // is special cased, but rather that m// and s//.../ are special cased. That special case also does not apply to qr//. Although similar to the idea of (*SPLIT_WHITE) I have also contemplated a meta pattern (*LAST_SUCCESSFUL_MATCH_PATTERN), which would embed the last successful match pattern in another pattern. This would mean we could get rid of the special case of the empty pattern, which I consider dangerous, and we would actually have a more useful construct, imagine something like this:

if ($str=~/$pat1/ or $str=~/$pat2/ or $str=~/$pat3/) {
    $str=~m/\((*LAST_SUCCESSFUL_MATCH_PATTERN)\)/;
}

which I admit I am not sure how it would be used, but I am pretty sure someone, (Damian?) would use it. :-)

Yves


Yves
From: Abigail <abigail [...] abigail.be>
Date: Fri, 12 Sep 2014 16:27:15 +0200
To: demerphq <demerphq [...] gmail.com>
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
On Fri, Sep 12, 2014 at 04:16:13PM +0200, demerphq wrote: Show quoted text
> On 12 September 2014 14:50, Abigail <abigail@abigail.be> wrote: >
> > On Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote:
> > > On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote: > > >
> > > > On Fri, Sep 12, 2014 at 05:19:49AM +0200, demerphq wrote: > > > > > > > > [snip]
> > >
> > > > > I dont like the exceptions here, and I find the inconsistency to be
> > very
> > > > > confusing. Regexes are hard enough without them having weird > > > > > inconsistencies like the ones we have.
> > > > > > > > But split is already full of exceptions: > > > > > > > > * Any pattern matching the empty string is special cased. > > > >
> > > > > > I don't know if I agree here. Part of this behavior is the default > > > behaviour for how an empty pattern matches > > > > > > perl -le'my $str="abcdef"; while($str=~//g) { print substr($str,$-[0],1)
> > }'
> > > a > > > b > > > c > > > d > > > e > > > f > > > > > > > > >
> > > > * // is even doubly special cased. > > > >
> > > > > > Perhaps ENOTENOUGHCOFFEE, but can you expand on that, I don't recall what > > > you mean. (Which in itself is a good reason to eliminate as many of the > > > inconsistencies as possible.)
> > > > > > From the split doc entry: > > > > As a special case for "split", the empty pattern given in match > > operator syntax ("//") specifically matches the empty string, > > which is contrary to its usual interpretation as the last > > successful match. > > > > So it's special cased to get to not mean the last succesful match, > > but to be the empty string. > > > >
> Oh that. Right. That isn't a special case in split, its a special case in > the m// and s/// operator that isn't present in split nor is it in qr//. > > $ perl -le'"foo"=~/(.*)/ and print $1; print qr//' > foo > (?^:) > > Also I thought we decided that that feature wasn't very useful and were > going to deprecate it. :-) > > >
> >
> > > > * " " is special cased (but / / isn't)
> > > > > > I consider the inability to simulate " " using a qr// or // a bug, and
> > it
> > > is on my todo list to fix. I don't consider one bug an excuse not to fix > > > other bugs.
> > > > > > How do you propose to "fix" that? Both C<< split " " >> and C<< split / /
> > >>
> > are quite frequent, and useful. > >
> > split qr/(*SPLIT_WHITE)/, $string > > is my working plan. I fixed one issue related to this, I think in 5.18, > where there was no way at all to parametrically get the split white > behavior. Now you can do this: > > ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for > split $str, $foo'
> >foo< > >bar<
> > But I think one should be able to do this with a qr// object as well. > > Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a > normal regex, and produce the split white special case behavior in a split > pattern. >
Are you saying you want to change the meaning of split " ", "string"; and people should write split qr /(*SPLIT_WHITE)/, "string"; instead? That would not be very programmer friendly. Abigail
Date: Fri, 12 Sep 2014 16:37:21 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 2.7k
On 12 September 2014 16:27, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 04:16:13PM +0200, demerphq wrote:
> On 12 September 2014 14:50, Abigail <abigail@abigail.be> wrote:
>
> > On Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote:
> > > On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:

[snip] 
Show quoted text
> > > >   * " " is special cased (but / / isn't)
> > >
> > > I consider the inability to simulate " " using a qr//  or // a bug, and
> > it
> > > is on my todo list to fix. I don't consider one bug an excuse not to fix
> > > other bugs.
> >
> >
> > How do you propose to "fix" that? Both C<< split " " >> and C<< split / /
> > >>
> > are quite frequent, and useful.
> >
>
> split qr/(*SPLIT_WHITE)/, $string
>
> is my working plan. I fixed one issue related to this, I think in 5.18,
> where there was no way at all to parametrically get the split white
> behavior. Now you can do this:
>
> ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for
> split $str, $foo'
> >foo<
> >bar<
>
> But I think one should be able to do this with a qr// object as well.
>
> Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a
> normal regex, and produce the split white special case behavior in a split
> pattern.
>



Are you saying you want to change the meaning of

    split " ", "string";

and people should write

    split qr /(*SPLIT_WHITE)/, "string";

instead?


That would not be very programmer friendly.

No no. I mean that I think code like this:

my $pat= qr/$user_pat/;

my @things= split /$pat/, $input;

should be capable of producing split white semantics.

I have no intention of removing the split " ", $string semantics, and on the contrary, the patch I mentioned for 5.18 means that I have made it even easier to do this kind of thing. Eg:

$ ./perl -Ilib -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for split $pat, $foo'
5.021004
>foo<
>bar<

$ perl -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" for split $pat, $foo'
5.014002
>foo


bar

<

Previously the *only* way to get split white behavior was to write *exactly* C<split " ", $string>, there was no other way to do it.

So before that patch if you wanted to parametrically control the split, you would need something like:

my @things= $pat eq " " ? split " ", $input : split $pat, $input;

you couldnt write this even:

my @things= split $pat eq " " ? " " : $pat, $input

To summarize, I would like to make it so you can use a qr// object to obtain *every* special case behaviour of split. I have no intention of  changing how split behaves when its argument is not a qr// object.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
From: Abigail <abigail [...] abigail.be>
To: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 16:45:24 +0200
Download (untitled) / with headers
text/plain 3.2k
On Fri, Sep 12, 2014 at 04:37:21PM +0200, demerphq wrote: Show quoted text
> On 12 September 2014 16:27, Abigail <abigail@abigail.be> wrote: >
> > On Fri, Sep 12, 2014 at 04:16:13PM +0200, demerphq wrote:
> > > On 12 September 2014 14:50, Abigail <abigail@abigail.be> wrote: > > >
> > > > On Fri, Sep 12, 2014 at 02:21:56PM +0200, demerphq wrote:
> > > > > On 12 September 2014 10:17, Abigail <abigail@abigail.be> wrote:
> > > > [snip]
>
> > > > > > * " " is special cased (but / / isn't)
> > > > > > > > > > I consider the inability to simulate " " using a qr// or // a bug,
> > and
> > > > it
> > > > > is on my todo list to fix. I don't consider one bug an excuse not to
> > fix
> > > > > other bugs.
> > > > > > > > > > > > How do you propose to "fix" that? Both C<< split " " >> and C<< split
> > / /
> > > > >>
> > > > are quite frequent, and useful. > > > >
> > > > > > split qr/(*SPLIT_WHITE)/, $string > > > > > > is my working plan. I fixed one issue related to this, I think in 5.18, > > > where there was no way at all to parametrically get the split white > > > behavior. Now you can do this: > > > > > > ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<"
> > for
> > > split $str, $foo'
> > > >foo< > > > >bar<
> > > > > > But I think one should be able to do this with a qr// object as well. > > > > > > Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a > > > normal regex, and produce the split white special case behavior in a
> > split
> > > pattern. > > >
> > > > > > > > Are you saying you want to change the meaning of > > > > split " ", "string"; > > > > and people should write > > > > split qr /(*SPLIT_WHITE)/, "string"; > > > > instead? > > > > > > That would not be very programmer friendly. > >
> > No no. I mean that I think code like this: > > my $pat= qr/$user_pat/; > > my @things= split /$pat/, $input; > > should be capable of producing split white semantics. > > I have no intention of removing the split " ", $string semantics, and on > the contrary, the patch I mentioned for 5.18 means that I have made it even > easier to do this kind of thing. Eg: > > $ ./perl -Ilib -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print > ">$_<" for split $pat, $foo' > 5.021004
> >foo< > >bar<
> > $ perl -le'print $]; my $pat=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" > for split $pat, $foo' > 5.014002
> >foo
> > > bar > > < > > Previously the *only* way to get split white behavior was to write > *exactly* C<split " ", $string>, there was no other way to do it. > > So before that patch if you wanted to parametrically control the split, you > would need something like: > > my @things= $pat eq " " ? split " ", $input : split $pat, $input; > > you couldnt write this even: > > my @things= split $pat eq " " ? " " : $pat, $input > > To summarize, I would like to make it so you can use a qr// object to > obtain *every* special case behaviour of split. I have no intention of > changing how split behaves when its argument is not a qr// object.
Excellent. I think adding a qr /(*SPLIT_WHITE)/, while keeping the existing behaviour of split, is a useful addition to the language. I presume $str =~ s/(*SPLIT_WHITE)/.../; and $str =~ /(*SPLIT_WHITE)/; will be meaningless, just as $pat = qr /(*SPLIT_WHITE)/; $str =~ /$pat/; Abigail
Date: Fri, 12 Sep 2014 16:56:27 +0200
To: Abigail <abigail [...] abigail.be>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>, Perl5 Porteros <perl5-porters [...] perl.org>
On 12 September 2014 16:45, Abigail <abigail@abigail.be> wrote:
Show quoted text
On Fri, Sep 12, 2014 at 04:37:21PM +0200, demerphq wrote:

> To summarize, I would like to make it so you can use a qr// object to
> obtain *every* special case behaviour of split. I have no intention of
>  changing how split behaves when its argument is not a qr// object.


Excellent.

I think adding a qr /(*SPLIT_WHITE)/, while keeping the existing behaviour
of split, is a useful addition to the language.


I presume

    $str =~ s/(*SPLIT_WHITE)/.../;

and

    $str =~ /(*SPLIT_WHITE)/;

will be meaningless, just as

    $pat = qr /(*SPLIT_WHITE)/;
    $str =~ /$pat/;

Well, no, I think making them illegal in normal patterns would be nearly impossible. The construct would need to do something, and I was thinking it might behave the same as \n+ or something like that. Im open to suggestions on what it does however, and I could probably be convinced to make it warn in a normal pattern, implementation permitting.

Yves

Yves
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 584b
On Fri Sep 12 05:22:28 2014, demerphq wrote: Show quoted text
> I consider the inability to simulate " " using a qr// or // a bug, > and it > is on my todo list to fix. I don't consider one bug an excuse not to > fix > other bugs.
Omitting initial empty fields is more a feature of split than of the regexp engine. Making a special pattern that does that makes as much sense to me as qr//c. If we were to consider the // to be part of the split operator (and I generally do), then we could introduce a m// modifier that only applies in split (and is an error otherwise). -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
On Fri Sep 12 07:16:50 2014, demerphq wrote: Show quoted text
> Oh that. Right. That isn't a special case in split, its a special > case in > the m// and s/// operator that isn't present in split nor is it in > qr//. > > $ perl -le'"foo"=~/(.*)/ and print $1; print qr//' > foo > (?^:) > > Also I thought we decided that that feature wasn't very useful and > were > going to deprecate it. :-)
I use it. Show quoted text
> split qr/(*SPLIT_WHITE)/, $string > > is my working plan. I fixed one issue related to this, I think in > 5.18, > where there was no way at all to parametrically get the split white > behavior. Now you can do this: > > ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<" > for > split $str, $foo'
> > foo< > > bar<
> > But I think one should be able to do this with a qr// object as well. > > Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a > normal regex, and produce the split white special case behavior in a > split > pattern.
But if we are going to generalise it, it would be useful to skip initial null fields with other separators, such as /,/, too. -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 419b
On Fri Sep 12 07:37:43 2014, demerphq wrote: Show quoted text
> To summarize, I would like to make it so you can use a qr// object to > obtain *every* special case behaviour of split. I have no intention of > changing how split behaves when its argument is not a qr// object.
To my mind, that just doesn’t add up. How is that much different from having a way to specify the second half of s/// with qr//? -- Father Chrysostomos
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.4k
On Fri Sep 12 07:57:02 2014, demerphq wrote: Show quoted text
> On 12 September 2014 16:45, Abigail <abigail@abigail.be> wrote: >
> > On Fri, Sep 12, 2014 at 04:37:21PM +0200, demerphq wrote: > >
> > > To summarize, I would like to make it so you can use a qr// object to > > > obtain *every* special case behaviour of split. I have no intention of > > > changing how split behaves when its argument is not a qr// object.
> > > > > > Excellent. > > > > I think adding a qr /(*SPLIT_WHITE)/, while keeping the existing behaviour > > of split, is a useful addition to the language. > > > > > > I presume > > > > $str =~ s/(*SPLIT_WHITE)/.../; > > > > and > > > > $str =~ /(*SPLIT_WHITE)/; > > > > will be meaningless, just as > > > > $pat = qr /(*SPLIT_WHITE)/; > > $str =~ /$pat/; > >
> > Well, no, I think making them illegal in normal patterns would be nearly > impossible. The construct would need to do something, and I was thinking it > might behave the same as \n+ or something like that. Im open to suggestions > on what it does however, and I could probably be convinced to make it warn > in a normal pattern, implementation permitting.
Oh, and what would split /(,)(?(1)(*SPLIT_WHITE))/ do? I just can’t wrap my mind around this \s+-and-a-split-flag construct. Maybe what we want is qr//k, where the /k flag is ignored by m// and s///, but is taken by split to mean sKip initial null fields. But then what would split /foo${that_qr}bar/ do? -- Father Chrysostomos
From: demerphq <demerphq [...] gmail.com>
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
Date: Fri, 12 Sep 2014 17:32:42 +0200
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.2k
On 12 September 2014 17:20, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
On Fri Sep 12 05:22:28 2014, demerphq wrote:
> I consider the inability to simulate " " using a qr//  or // a bug,
> and it
> is on my todo list to fix. I don't consider one bug an excuse not to
> fix
> other bugs.

Omitting initial empty fields is more a feature of split than of the regexp engine.  Making a special pattern that does that makes as much sense to me as qr//c.

qr//c is obvious useless. On the other hand two *very* experienced regex people, Abigail and myself, both see the utility of a (*SPLIT_WHITE) meta pattern that allows split to trigger the special case triggered by split //, $foo. I think that is sufficient justification to overlook your inability to see its utility.
 
Show quoted text
  If we were to consider the // to be part of the split operator (and I generally do),

I consider that wrong. Split is a function which uses a pattern as an argument, and changes its behaviour based on what that pattern is.
 
Show quoted text
then we could introduce a m// modifier that only applies in split (and is an error otherwise).

I dont think a modifier is required, or even a particularly elegant solution to this.

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"
From: demerphq <demerphq [...] gmail.com>
Date: Fri, 12 Sep 2014 17:35:19 +0200
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.5k
On 12 September 2014 17:22, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
On Fri Sep 12 07:16:50 2014, demerphq wrote:
> Oh that.  Right. That isn't a special case in split, its a special
> case in
> the m// and s/// operator that isn't present in split nor is it in
> qr//.
>
> $ perl -le'"foo"=~/(.*)/ and print $1; print qr//'
> foo
> (?^:)
>
> Also I thought we decided that that feature wasn't very useful and
> were
> going to deprecate it. :-)

I use it.


Yes, I think I have used it once or twice in my career. However the fact that a very small number of people use  a feature that most consider confusing and dangerous is not generally a reason not to deprecate it. If it was widely used then it would be different.

 
Show quoted text
> split qr/(*SPLIT_WHITE)/, $string
>
> is my working plan. I fixed one issue related to this, I think in
> 5.18,
> where there was no way at all to parametrically get the split white
> behavior. Now you can do this:
>
> ./perl -Ilib -le'my $str=" "; my $foo="foo\n\n\nbar\n\n"; print ">$_<"
> for
> split $str, $foo'
> > foo<
> > bar<
>
> But I think one should be able to do this with a qr// object as well.
>
> Basically (*SPLIT_WHITE) would be semantically equivalent to \s+ in a
> normal regex, and produce the split white special case behavior in a
> split
> pattern.

But if we are going to generalise it, it would be useful to skip initial null fields with other separators, such as /,/, too.

Then we can create a pattern that does it. (*EAT_EMPTY) maybe.

Yves
Date: Fri, 12 Sep 2014 17:35:56 +0200
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 570b
On 12 September 2014 17:24, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
On Fri Sep 12 07:37:43 2014, demerphq wrote:
> To summarize, I would like to make it so you can use a qr// object to
> obtain *every* special case behaviour of split. I have no intention of
>  changing how split behaves when its argument is not a qr// object.

To my mind, that just doesn’t add up.  How is that much different from having a way to specify the second half of s/// with qr//?

Completely different. As different as jet-planes and penguins.

Yves
Date: Fri, 12 Sep 2014 17:41:37 +0200
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.9k
On 12 September 2014 17:28, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
On Fri Sep 12 07:57:02 2014, demerphq wrote:
> On 12 September 2014 16:45, Abigail <abigail@abigail.be> wrote:
>
> > On Fri, Sep 12, 2014 at 04:37:21PM +0200, demerphq wrote:
> >
> > > To summarize, I would like to make it so you can use a qr// object to
> > > obtain *every* special case behaviour of split. I have no intention of
> > >  changing how split behaves when its argument is not a qr// object.
> >
> >
> > Excellent.
> >
> > I think adding a qr /(*SPLIT_WHITE)/, while keeping the existing behaviour
> > of split, is a useful addition to the language.
> >
> >
> > I presume
> >
> >     $str =~ s/(*SPLIT_WHITE)/.../;
> >
> > and
> >
> >     $str =~ /(*SPLIT_WHITE)/;
> >
> > will be meaningless, just as
> >
> >     $pat = qr /(*SPLIT_WHITE)/;
> >     $str =~ /$pat/;
> >
>
> Well, no, I think making them illegal in normal patterns would be nearly
> impossible. The construct would need to do something, and I was thinking it
> might behave the same as \n+ or something like that. Im open to suggestions
> on what it does however, and I could probably be convinced to make it warn
> in a normal pattern, implementation permitting.

Oh, and what would split /(,)(?(1)(*SPLIT_WHITE))/ do?

Not sure yet. Maybe nothing.
 
Show quoted text

I just can’t wrap my mind around this \s+-and-a-split-flag construct.


I couldn't possibly comment on your inability to wrap your mind around this.

 
Show quoted text
Maybe what we want is qr//k, where the /k flag is ignored by m// and s///, but is taken by split to mean sKip initial null fields.

/k is unavailable to us due to Regexp::Common.

Although i retract an earlier comment, *maybe* a modifier is appropriate for some of these issues. It deserves more reflection than I gave it on a previous mail.
 
Show quoted text

But then what would split /foo${that_qr}bar/ do?

Probably just revert to its "normal" regex behaviour.

cheers,
Yves
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 141b
I dont think this ticket is productive anymore. 20 posts in half a day between just 2, or maybe 3 people. -- bulk88 ~ bulk88 at hotmail.com
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 978b
On Fri Sep 12 08:33:15 2014, demerphq wrote: Show quoted text
> qr//c is obvious useless. On the other hand two *very* experienced > regex > people, Abigail and myself, both see the utility of a (*SPLIT_WHITE) > meta > pattern that allows split to trigger the special case triggered by > split > //, $foo. I think that is sufficient justification to overlook your > inability to see its utility.
It’s not that I do not see its utility. It just seems like too much of a special case, and I thought we were trying to get away from those. If it’s something that goes in a pattern, but affects the behaviour of one specific operator that acts on the pattern, then what is its scope? etc., etc. Now, if we want to add a thingy that goes in a pattern and flags the pattern to tell split not skip initial fields, then let’s make it general. E.g., your /(*SPLIT_WHITE)/ could be written /(*EAT_EMPTY)\s+/ or /(?q)\s+/ or /\s+/q (with q only because q is available). -- Father Chrysostomos
Date: Fri, 12 Sep 2014 23:37:47 +0100
To: perl5-porters [...] perl.org
From: hv [...] crypt.org
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 952b
I've completely lost track of the bifurcating paths of the discussion, so I hope we'll get a synopsis of a proposal at some point, ideally as a separate thread. Somewhere in there were references to making split patterns act as if they had //m on by default. That sounds like choosing to exchange one crazy set of behaviours in all previous versions of perl with a differently crazy set of behaviours in all subsequent versions. I may have got the proposal wrong though. Not sure that I've used /^/ or /\A/ much in split patterns, but I've almost certainly used /$/ or /\z/, probably with things like: /($delimiter)(?=$field(?=$delimiter|$))/ .. where the patterns for $delimiter and $field were sufficiently ambiguous. I was never aware of an implied //m, so I've never knowingly used that. I'd much rather deprecate the special case, and let people say what they mean, than invite further breakage by extending the special case further. Hugo
To: perl5-porters [...] perl.org
Date: Fri, 12 Sep 2014 21:08:52 -0400
From: Ricardo Signes <perl.p5p [...] rjbs.manxome.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
Download (untitled) / with headers
text/plain 1.1k
* demerphq <demerphq@gmail.com> [2014-09-11T14:20:28] Show quoted text
> In fact the process of writing this email I have become sufficiently > convinced that option /m is the right thing to do and that I will start > writing the patch now so we can find out if it breaks anything.
I am sitting here making my "I am so nervous face," but I also can't really come up with much that I think will be affected. As a side note, I did find this amusing line: https://metacpan.org/source/ANDYA/TAP-Parser-0.54/t/040-parse.t#L630 Anyway, on one hand and in one way this is a big scary change that makes me antsy, but on the other hand, I think it will trade one goofy special case for another straightforward one. This is not to say that I'm saying "do it!" But it sounds like you want to write the patch, and if you do that, we can smoke CPAN and also look at specific changes. So I think that's a decent step forward. A lot of other stuff came up in this thread about /other/ changes to split and patterns. I think they should be discussed on their own, rather than as part of discussing whether/how/why to fix splitting on /\A/. -- rjbs
Download signature.asc
application/pgp-signature 473b

Message body not shown because it is not plain text.

CC: "bugs-bitbucket [...] rt.perl.org" <bugs-bitbucket [...] rt.perl.org>
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
From: demerphq <demerphq [...] gmail.com>
Date: Wed, 17 Sep 2014 05:12:52 +0200
To: Perl5 Porteros <perl5-porters [...] perl.org>
Download (untitled) / with headers
text/plain 1.9k
On 11 September 2014 16:29, demerphq <demerphq@gmail.com> wrote:
Show quoted text
On 11 September 2014 14:28, l.mai@web.de <perlbug-followup@perl.org> wrote:
# New Ticket Created by  l.mai@web.de
# Please include the string:  [perl #122761]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org/Ticket/Display.html?id=122761 >


perldoc perlrebackslash:

   \A  "\A" only matches at the beginning of the string.

perldoc -f split:

   Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field.


Therefore split /\A/ should return the input string as is. \A can only match once (at offset 0), which (logically speaking) should turn "foo" into ("", "foo"), but because of the special case in split of not producing empty leading fields for zero-width matches at the beginning, we just get "foo" again.
[split] 
Show quoted text
I can do some kind of workaround that makes /\A/ not trigger this optimisation, 

I have fixed this with:

1645b83c5ceecd8a95db0310d80125d8b188eb83 Perl RT #122761 - split /\A/ should not behave like split /^/m
aa48e906ca55e0da8e1317549a4ddafff3837f3f change NODE_ALIGN_FILL to set flags to 0
d3d47aac53402ea3d4836c60e3659dc927a9887c Eliminate the duplicative regops BOL and EOL

Note that this does NOT make split // default to /m enabled. It simply allows the split optimisation involved to distinguish between /^/ and /\A/. 

Related to this I did some cleanup, freeing up bits, reducing object size, and other simplifications.

/me puts away the chainsaw.

I still plan to try the "default to /m in split" and see what happens, so please do not close this ticket right away, even though 1645b83c5ceecd8a95db0310d80125d8b188eb83 does fix the actual issue reported in this ticket.

Yves 

--
perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 494b
On Tue Sep 16 20:13:15 2014, demerphq wrote: Show quoted text
> I have fixed this with: > > 1645b83c5ceecd8a95db0310d80125d8b188eb83 Perl RT #122761 - split /\A/ > should not behave like split /^/m > aa48e906ca55e0da8e1317549a4ddafff3837f3f change NODE_ALIGN_FILL to set > flags to 0 > d3d47aac53402ea3d4836c60e3659dc927a9887c Eliminate the duplicative > regops > BOL and EOL
Did the porting tests fail before you ran make regen to regenerate the table in perldebguts.pod? **Duck** -- Father Chrysostomos
Subject: Re: [perl #122761] split /\A/ works like /^/m, matches embedded newlines
CC: Perl5 Porteros <perl5-porters [...] perl.org>
Date: Wed, 17 Sep 2014 07:18:20 +0200
To: Perl RT Bug Tracker <perlbug-followup [...] perl.org>
From: demerphq <demerphq [...] gmail.com>
Download (untitled) / with headers
text/plain 924b
On 17 September 2014 06:36, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
Show quoted text
On Tue Sep 16 20:13:15 2014, demerphq wrote:
> I have fixed this with:
>
> 1645b83c5ceecd8a95db0310d80125d8b188eb83 Perl RT #122761 - split /\A/
> should not behave like split /^/m
> aa48e906ca55e0da8e1317549a4ddafff3837f3f change NODE_ALIGN_FILL to set
> flags to 0
> d3d47aac53402ea3d4836c60e3659dc927a9887c Eliminate the duplicative
> regops
> BOL and EOL

Did the porting tests fail before you ran make regen to regenerate the table in perldebguts.pod?

**Duck**

No. d3d47aac53 includes changes to regen/regcomp.pl and regcomp.sym which necessitated a regen anyway.

I did however leak some warning/diagnostics into the porting tests, which should only be shown when it is run manually, which i fixed in 53e19030564ba.

Smarty pants. :-)

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 977b
On Tue Sep 16 20:13:15 2014, demerphq wrote: Show quoted text
> I have fixed this with: > > 1645b83c5ceecd8a95db0310d80125d8b188eb83 Perl RT #122761 - split /\A/ > should not behave like split /^/m > aa48e906ca55e0da8e1317549a4ddafff3837f3f change NODE_ALIGN_FILL to set > flags to 0 > d3d47aac53402ea3d4836c60e3659dc927a9887c Eliminate the duplicative > regops > BOL and EOL > > Note that this does NOT make split // default to /m enabled. It simply > allows the split optimisation involved to distinguish between /^/ and > /\A/. > > Related to this I did some cleanup, freeing up bits, reducing object > size, > and other simplifications. > > /me puts away the chainsaw. > > I still plan to try the "default to /m in split" and see what happens, > so > please do not close this ticket right away, even though > 1645b83c5ceecd8a95db0310d80125d8b188eb83 does fix the actual issue > reported > in this ticket.
Shouldn't this be done in a new ticket then? (Also, is this still happening?)
RT-Send-CC: perl5-porters [...] perl.org
Download (untitled) / with headers
text/plain 1.1k
On Fri Feb 26 10:53:38 2016, mauke- wrote: Show quoted text
> On Tue Sep 16 20:13:15 2014, demerphq wrote:
> > I have fixed this with: > > > > 1645b83c5ceecd8a95db0310d80125d8b188eb83 Perl RT #122761 - split /\A/ > > should not behave like split /^/m > > aa48e906ca55e0da8e1317549a4ddafff3837f3f change NODE_ALIGN_FILL to > > set > > flags to 0 > > d3d47aac53402ea3d4836c60e3659dc927a9887c Eliminate the duplicative > > regops > > BOL and EOL > > > > Note that this does NOT make split // default to /m enabled. It > > simply > > allows the split optimisation involved to distinguish between /^/ and > > /\A/. > > > > Related to this I did some cleanup, freeing up bits, reducing object > > size, > > and other simplifications. > > > > /me puts away the chainsaw. > > > > I still plan to try the "default to /m in split" and see what > > happens, > > so > > please do not close this ticket right away, even though > > 1645b83c5ceecd8a95db0310d80125d8b188eb83 does fix the actual issue > > reported > > in this ticket.
> > Shouldn't this be done in a new ticket then? (Also, is this still > happening?)
I recommend closing this ticket and having anyone pursuing this open a new ticket. -- James E Keenan (jkeenan@cpan.org)


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

For issues related to this RT instance (aka "perlbug"), please contact perlbug-admin at perl.org