Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text::ParseWords does not handle backslashed newline inside quoted text #7385

Closed
p5pRT opened this issue Jun 24, 2004 · 8 comments
Closed

Text::ParseWords does not handle backslashed newline inside quoted text #7385

p5pRT opened this issue Jun 24, 2004 · 8 comments

Comments

@p5pRT
Copy link

p5pRT commented Jun 24, 2004

Migrated from rt.perl.org#30442 (status was 'resolved')

Searchable as RT30442$

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

From farghorn@yahoo.com

I am trying to parse a "line" (really a string) of
text using
Text​::ParseWords. I have quoted fields delimited by a
"tab"
(\t) character. Within the quoted fields, an actual
linefeed
(0x0a) character can appear, escaped with a backslash.
e.g.

"field1" "field2\
still field2" "field3"
__END__

I looked at the source, and the main regex in
parse_line()
appears to be supposed to handle backslashed things
inside
the quoted text. I think this should include a
newline.
The problem is using the "." without the "/s"
modifier.

At first, I tested out changing the regex to use the
"/s"
modifier (in addition to the "/x" modifier of course).
It
worked well on my data. It then occurred to me that
the
more correct solution would be to change "." to
"[\000-\377]"

I then noticed that just a plain linefeed, not
backslashed,
works fine, which makes me much closer to 100% sure
that this
is a bug, and not somehow the intended behaviour.

Here is a patch​:

*** ../app/perl/lib/5.8.2/Text/ParseWords.pm Thu
Jan 1 15​:24​:48 2004
--- ./lib/Text/ParseWords.pm Wed Jun 23
14​:14​:43 2004
***************
*** 59,69 ****

  ($quote, $quoted, undef, $unquoted, $delim,
undef) =
  $line =~ m/^(["']) # a
$quote
! ((?​:\\.|(?!\1)[^\\])*) #
and $quoted text
  \1 #
followed by the same quote
  ([\000-\377]*)
# and the rest
  | #
--OR--
! ^((?​:\\.|[^\\"'])*?) # an
$unquoted text
 
(\Z(?!\n)|(?-x​:$delimiter)|(?!^)(?=["']))
  #
plus EOL, delimiter, or quote
  ([\000-\377]*) # the
rest
--- 59,69 ----

  ($quote, $quoted, undef, $unquoted, $delim,
undef) =
  $line =~ m/^(["']) # a
$quote
!
((?​:\\[\000-\377]|(?!\1)[^\\])*) # and $quoted text
  \1 #
followed by the same quote
  ([\000-\377]*)
# and the rest
  | #
--OR--
!
^((?​:\\[\000-\377]|[^\\"'])*?) # an $unquoted text
 
(\Z(?!\n)|(?-x​:$delimiter)|(?!^)(?=["']))
  #
plus EOL, delimiter, or quote
  ([\000-\377]*) # the
rest
***************
*** 76,84 ****
  $quoted = "$quote$quoted$quote";
  }
  else {
! $unquoted =~ s/\\(.)/$1/g;
  if (defined $quote) {
! $quoted =~ s/\\(.)/$1/g if ($quote eq
'"');
  $quoted =~ s/\\([\\'])/$1/g if (
$PERL_SINGLE_QUOTE && $quote eq "'");
  }
  }
--- 76,84 ----
  $quoted = "$quote$quoted$quote";
  }
  else {
! $unquoted =~ s/\\([\000-\377])/$1/g;
  if (defined $quote) {
! $quoted =~ s/\\([\000-\377])/$1/g if
($quote eq '"');
  $quoted =~ s/\\([\\'])/$1/g if (
$PERL_SINGLE_QUOTE && $quote eq "'");
  }
  }

__END__

Perl Info

Flags:
    category=library
    severity=high

Site configuration information for perl v5.8.2:

Configured by sharvitd_s at Thu Jan  1 14:51:25 IST
2004.

Summary of my perl5 (revision 5.0 version 8 subversion
2) configuration:
  Platform:
    osname=linux, osvers=2.4.9-e.3,
archname=i686-linux-64int-ld
    uname='linux rachel 2.4.9-e.3 #1 fri may 3
17:02:43 edt 2002 i686 unknown '
    config_args=''
    hint=recommended, useposix=true,
d_sigaction=define
    usethreads=undef use5005threads=undef
useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define
usesocks=undef
    use64bitint=define use64bitall=undef
uselongdouble=define
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fno-strict-aliasing
-I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fno-strict-aliasing
-I/usr/local/include'
    ccversion='', gccversion='2.96 20000731 (Red Hat
Linux 7.2 2.96-108.1)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8,
byteorder=12345678
    d_longlong=define, longlongsize=8,
d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='long
double', nvsize=12, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.2.4.so, so=so, useshrplib=false,
libperl=libperl.a
    gnulibc_version='2.2.4'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef,
ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared
-L/usr/local/lib'

Locally applied patches:



@INC for perl v5.8.2:
   
/exlibris/sfx_ver/sfx_version_3/app/perl/lib/5.8.2/i686-linux-64int-ld
    /exlibris/sfx_ver/sfx_version_3/app/perl/lib/5.8.2
   
/exlibris/sfx_ver/sfx_version_3/app/perl/lib/site_perl/5.8.2/i686-linux-64int-ld
   
/exlibris/sfx_ver/sfx_version_3/app/perl/lib/site_perl/5.8.2
   
/exlibris/sfx_ver/sfx_version_3/app/perl/lib/site_perl
    .


Environment for perl v5.8.2:
    HOME=/exlibris/sfx_ver/sfx_version_3/ed_3/home
    LANG=en_US
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
   
PATH=/exlibris/sfx_ver/sfx_version_3/app/mysql/bin:/exlibris/sfx_ver/sfx_version_3/app/perl/bin:/exlibris/sfx_ver/sfx_vers
ion_3/app/utils:/opt/IBMJava2-131/bin:/opt/IBMJava2-131/jre/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.:/exlibris/sfx_ve
r/sfx_version_3/ed_3/home/bin
    PERL_BADLANG (unset)
   
PERL_HOME=/exlibris/sfx_ver/sfx_version_3/app/perl/bin
    SHELL=/bin/tcsh

--i5NG5bW25034.1088006740/rachel.exlibris-int.il--




		
__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

farghorn@yahoo.com - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

farghorn@yahoo.com - Status changed from 'open' to 'new'

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

From @mhx

On 2004-06-24, at 07​:40​:12 -0000, Ephraim Dan (via RT) wrote​:

# New Ticket Created by Ephraim Dan
# Please include the string​: [perl #30442]
# in the subject line of all future correspondence about this issue.
# <URL​: http​://rt.perl.org​:80/rt3/Ticket/Display.html?id=30442 >

This is a bug report for perl from
ephraim.dan@​exlibris.co.il,
generated with the help of perlbug 1.34 running under
perl v5.8.2.

-----------------------------------------------------------------
[Please enter your report here]

I am trying to parse a "line" (really a string) of
text using
Text​::ParseWords. I have quoted fields delimited by a
"tab"
(\t) character. Within the quoted fields, an actual
linefeed
(0x0a) character can appear, escaped with a backslash.
e.g.

"field1" "field2\
still field2" "field3"
__END__

Thanks for your report.

First of all, please try to use a different mailer next time you
submit a patch, or attach the patch instead of inlining it.

My head is still aching from trying to read this one.

The bug has been fixed by the patch below using your suggestion
and adding some tests.

Marcus

Change 22992 by mhx@​mhx-r2d2 on 2004/06/24 19​:51​:06

  Fix for​: [perl #30442] Text​::ParseWords does not handle backslashed newline inside quoted text
  Use the suggested regex fix, plus some tests.

Affected files ...

... //depot/perl/lib/Text/ParseWords.pm#16 edit
... //depot/perl/lib/Text/ParseWords.t#2 edit

Differences ...

==== //depot/perl/lib/Text/ParseWords.pm#16 (text) ====

@​@​ -1,7 +1,7 @​@​
package Text​::ParseWords;

use vars qw($VERSION @​ISA @​EXPORT $PERL_SINGLE_QUOTE);
-$VERSION = "3.21";
+$VERSION = "3.22";

require 5.000;

@​@​ -59,11 +59,11 @​@​

  ($quote, $quoted, undef, $unquoted, $delim, undef) =
  $line =~ m/^(["']) # a $quote
- ((?​:\\.|(?!\1)[^\\])*) # and $quoted text
+ ((?​:\\[\000-\377]|(?!\1)[^\\])*) # and $quoted text
  \1 # followed by the same quote
  ([\000-\377]*) # and the rest
  | # --OR--
- ^((?​:\\.|[^\\"'])*?) # an $unquoted text
+ ^((?​:\\[\000-\377]|[^\\"'])*?) # an $unquoted text
  (\Z(?!\n)|(?-x​:$delimiter)|(?!^)(?=["']))
  # plus EOL, delimiter, or quote
  ([\000-\377]*) # the rest
@​@​ -76,9 +76,9 @​@​
  $quoted = "$quote$quoted$quote";
  }
  else {
- $unquoted =~ s/\\(.)/$1/g;
+ $unquoted =~ s/\\([\000-\377])/$1/g;
  if (defined $quote) {
- $quoted =~ s/\\(.)/$1/g if ($quote eq '"');
+ $quoted =~ s/\\([\000-\377])/$1/g if ($quote eq '"');
  $quoted =~ s/\\([\\'])/$1/g if ( $PERL_SINGLE_QUOTE && $quote eq "'");
  }
  }

==== //depot/perl/lib/Text/ParseWords.t#2 (xtext) ====

@​@​ -8,7 +8,7 @​@​
use warnings;
use Text​::ParseWords;

-print "1..18\n";
+print "1..20\n";

@​words = shellwords(qq(foo "bar quiz" zoo));
print "not " if $words[0] ne 'foo';
@​@​ -108,3 +108,14 @​@​
@​words = quotewords(' ', 1, '4 3 2 1 0');
print "not " unless join(";", @​words) eq qq(4;3;2;1;0);
print "ok 18\n";
+
+# [perl #30442] Text​::ParseWords does not handle backslashed newline inside quoted text
+$string = qq{"field1" "field2\\\nstill field2" "field3"};
+
+$result = join('|', parse_line("\t", 1, $string));
+print "not " unless $result eq qq{"field1"|"field2\\\nstill field2"|"field3"};
+print "ok 19\n";
+
+$result = join('|', parse_line("\t", 0, $string));
+print "not " unless $result eq "field1|field2\nstill field2|field3";
+print "ok 20\n";

@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

The RT System itself - Status changed from 'new' to 'open'

@p5pRT p5pRT closed this as completed Jun 24, 2004
@p5pRT
Copy link
Author

p5pRT commented Jun 24, 2004

@mhx - Status changed from 'open' to 'resolved'

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2004

From @hvds

Marcus Holland-Moritz <mhx-perl@​gmx.net> wrote​:
:The bug has been fixed by the patch below using your suggestion
:and adding some tests.
[...]
:- ((?​:\\.|(?!\1)[^\\])*) # and $quoted text
:+ ((?​:\\[\000-\377]|(?!\1)[^\\])*) # and $quoted text
[etc]

It might make these patterns clearer to add a /s flag, replace C< . >
with C< [^\n] >, and replace C< [\000-\377] > with C< . >. That would
also potentially play nicer with Unicode text​: consider "\\\x{1234}".

Hugo

@p5pRT
Copy link
Author

p5pRT commented Jun 25, 2004

From @mhx

On 2004-06-25, at 15​:05​:49 +0100, hv@​crypt.org wrote​:

Marcus Holland-Moritz <mhx-perl@​gmx.net> wrote​:
:The bug has been fixed by the patch below using your suggestion
:and adding some tests.
[...]
:- ((?​:\\.|(?!\1)[^\\])*) # and $quoted text
:+ ((?​:\\[\000-\377]|(?!\1)[^\\])*) # and $quoted text
[etc]

It might make these patterns clearer to add a /s flag, replace C< . >
with C< [^\n] >, and replace C< [\000-\377] > with C< . >. That would
also potentially play nicer with Unicode text​: consider "\\\x{1234}".

Absolutely.

The change below does just that, and in addition makes parse_line()
slightly faster for short strings and a lot faster for long strings
by usings s/// instead of m// and $+.

Marcus

Change 22997 by mhx@​mhx-r2d2 on 2004/06/25 20​:27​:05

  Cleanup the main regex in Text​::ParseWords and make the
  parse_line() routine faster. Add a Unicode test case.

Affected files ...

... //depot/perl/lib/Text/ParseWords.pm#17 edit
... //depot/perl/lib/Text/ParseWords.t#3 edit

Differences ...

==== //depot/perl/lib/Text/ParseWords.pm#17 (text) ====

@​@​ -1,7 +1,7 @​@​
package Text​::ParseWords;

use vars qw($VERSION @​ISA @​EXPORT $PERL_SINGLE_QUOTE);
-$VERSION = "3.22";
+$VERSION = "3.23";

require 5.000;

@​@​ -53,32 +53,27 @​@​
  use re 'taint'; # if it's tainted, leave it as such

  my($delimiter, $keep, $line) = @​_;
- my($quote, $quoted, $unquoted, $delim, $word, @​pieces);
+ my($word, @​pieces);

  while (length($line)) {
+ $line =~ s/^(["']) # a $quote
+ ((?​:\\.|(?!\1)[^\\])*) # and $quoted text
+ \1 # followed by the same quote
+ | # --OR--
+ ^((?​:\\.|[^\\"'])*?) # an $unquoted text
+ (\Z(?!\n)|(?-x​:$delimiter)|(?!^)(?=["']))
+ # plus EOL, delimiter, or quote
+ //xs; # extended layout
+ my($quote, $quoted, $unquoted, $delim) = ($1, $2, $3, $4);
+ return() unless( defined($quote) || length($unquoted) || length($delim));

- ($quote, $quoted, undef, $unquoted, $delim, undef) =
- $line =~ m/^(["']) # a $quote
- ((?​:\\[\000-\377]|(?!\1)[^\\])*) # and $quoted text
- \1 # followed by the same quote
- ([\000-\377]*) # and the rest
- | # --OR--
- ^((?​:\\[\000-\377]|[^\\"'])*?) # an $unquoted text
- (\Z(?!\n)|(?-x​:$delimiter)|(?!^)(?=["']))
- # plus EOL, delimiter, or quote
- ([\000-\377]*) # the rest
- /x; # extended layout
- return() unless( $quote || length($unquoted) || length($delim));
-
- $line = $+;
-
  if ($keep) {
  $quoted = "$quote$quoted$quote";
  }
  else {
- $unquoted =~ s/\\([\000-\377])/$1/g;
+ $unquoted =~ s/\\(.)/$1/sg;
  if (defined $quote) {
- $quoted =~ s/\\([\000-\377])/$1/g if ($quote eq '"');
+ $quoted =~ s/\\(.)/$1/sg if ($quote eq '"');
  $quoted =~ s/\\([\\'])/$1/g if ( $PERL_SINGLE_QUOTE && $quote eq "'");
  }
  }

==== //depot/perl/lib/Text/ParseWords.t#3 (xtext) ====

@​@​ -8,7 +8,7 @​@​
use warnings;
use Text​::ParseWords;

-print "1..20\n";
+print "1..21\n";

@​words = shellwords(qq(foo "bar quiz" zoo));
print "not " if $words[0] ne 'foo';
@​@​ -119,3 +119,9 @​@​
$result = join('|', parse_line("\t", 0, $string));
print "not " unless $result eq "field1|field2\nstill field2|field3";
print "ok 20\n";
+
+# unicode
+$string = qq{"field1"\x{1234}"field2\\\x{1234}still field2"\x{1234}"field3"};
+$result = join('|', parse_line("\x{1234}", 0, $string));
+print "not " unless $result eq "field1|field2\x{1234}still field2|field3";
+print "ok 21\n";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant