Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod::HTML should use a proper Unicode-aware definition of "word character" #12026

Closed
p5pRT opened this issue Mar 30, 2012 · 12 comments
Closed

Pod::HTML should use a proper Unicode-aware definition of "word character" #12026

p5pRT opened this issue Mar 30, 2012 · 12 comments
Assignees
Labels
distro-All ext/Pod-Html issues in the blead-upstream Pod-Html distribution installhtml Problems with 'installhtml' program or 'make' target type-library type-Unicode type-utilities

Comments

@p5pRT
Copy link

p5pRT commented Mar 30, 2012

Migrated from rt.perl.org#112140 (status was 'open')

Searchable as RT112140$

@p5pRT
Copy link
Author

p5pRT commented Mar 26, 2011

From horton-p@aist.go.jp

Background​: Tom Christiansen asked me to report the following as
a bug, in response to an email from me.

Symptom​: When running pod2html on files with non-ascii characters in link anchor names,
those characters are replaced with underscores.

Workaround​: In Pod/Html.pm, comment out line containing "$anchor =~ s/\W/_/g;"
This is line 2066 in Pod​::Html Version 1.08.

Perl Info
---
Flags:
category=
severity=
---
Site configuration information for perl 5.10.0:

Configured by Debian Project at Fri Jun 26 18:43:11 UTC 2009.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
Platform:
osname=linux, osvers=2.6.24-23-server, archname=i486-linux-gnu-thread-multi
uname='linux rothera 2.6.24-23-server #1 smp wed apr 1 22:22:14 utc 2009 i686 gnulinux '
config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/lib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.0 -Dsitearch=/usr/local/lib/perl/5.10.0 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.0 -Dd_dosuid -des'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -g',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include'
ccversion='', gccversion='4.3.3', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='cc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib /usr/lib64
libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt
perllibs=-ldl -lm -lpthread -lc -lcrypt
libc=/lib/libc-2.9.so, so=so, useshrplib=true, libperl=libperl.so.5.10.0
gnulibc_version='2.9'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib'

Locally applied patches:


---
@INC for perl 5.10.0:
/etc/perl
/usr/local/lib/perl/5.10.0
/usr/local/share/perl/5.10.0
/usr/lib/perl5
/usr/share/perl5
/usr/lib/perl/5.10
/usr/share/perl/5.10
/usr/local/lib/site_perl
.

---
Environment for perl 5.10.0:
HOME=/home/paulh
LANG=ja_JP.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/paulh/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/paulh/bin:/home/paulh/cvs/cbrc/C++/noncppUtils:/home/paulh/cvs/utils/searchTools:/home/paulh/cvs/cbrc/utils:/home/paulh/cvs/paulh/paulScripts
PERL_BADLANG (unset)
SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Mar 27, 2011

From @rjbs

There's a Google Summer of Code proposal to fix this, along with many
other Pod​::* problems in core. I suggest we not worry about this bug
until the fate of that proposal is clear.

@p5pRT
Copy link
Author

p5pRT commented Mar 27, 2011

@rjbs - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Nov 27, 2011

From @cpansprout

On Sat Mar 26 02​:56​:26 2011, horton-p@​aist.go.jp wrote​:

Background​: Tom Christiansen asked me to report the following as
a bug, in response to an email from me.

Symptom​: When running pod2html on files with non-ascii characters in
link anchor names,
those characters are replaced with underscores.

Workaround​: In Pod/Html.pm, comment out line containing "$anchor =~
s/\W/_/g;"
This is line 2066 in Pod​::Html Version 1.08.

On Sun Mar 27 06​:22​:13 2011, rjbs wrote​:

There's a Google Summer of Code proposal to fix this, along with many
other Pod​::* problems in core. I suggest we not worry about this bug
until the fate of that proposal is clear.

This problem still exists in blead, though I’m unsure whether it is
really a problem, as such anchors are for Pod​::Html itself to use, no?

--

Father Chrysostomos

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2012

From @nwc10

Pod​::Html has this​:

  use locale; # make \w work right in non-ASCII lands

It was added in 1998 by this commit​:

commit 3ec0728
Author​: Fyodor Krasnov <fyodor@​aha.ru>
Date​: Tue Nov 24 22​:00​:36 1998 +0300

  Pod​::Html and Pod​::Text were not locale-savvy​:
  for example in =head1 all non-ASCII-\w-runs were
  turned into underscores in NAME tags. This could
  result in several NAME tags becoming identical.
  Reported by​:
  Subject​: pod2html vs Russian Characters
  To​: Tom.Christiansen@​snn.aha.ru, tchrist@​perl.com
  Message-Id​: <199811241600.TAA05149@​stat.aha.ru>
 
  p4raw-id​: //depot/cfgperl@​2435

The code referenced is this​:

#
# similar to htmlify, but turns non-alphanumerics into underscores
#
sub anchorify {
  my ($anchor) = @​_;
  $anchor = htmlify($anchor);
  $anchor =~ s/\W/_/g;
  return $anchor;
}

At first glance it would seem better to replace that \W with a POSIX character
class or Unicode property that reliably expresses the intent.

However, with the refactor to use Pod​::Simple​::XHTML &anchorify is no longer
used by any code within Pod​::Html, and the only external user on CPAN* seems
to be installhtml. Hence it's not clear if a better plan is to deprecate the
function. (And similarly htmlify, as it's unused)

Nicholas Clark

* There are several copies and derivatives of Pod​::HTML on CPAN - I couldn't
  spot anything using Pod​::HTML​::anchorify

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2012

From tchrist@perl.com

Actually, I think \W is the correct thing. Things have changed.
I think we're looking at an outdated processing model model

This should only be an issue now if both these held true​:

(1) if they were using byte strings whose high-bit bytes were
  intended to be intrepreted in some 8-bit locale

and

(2) There were no =encoding directive.

With everyone moving to UTF-8, or else giving an explicit encoding,
this should not come up.

Also, the unicode_strings feature would also take care of the matter.

However, with the refactor to use Pod​::Simple​::XHTML &anchorify is no longer
used by any code within Pod​::Html, and the only external user on CPAN* seems
to be installhtml. Hence it's not clear if a better plan is to deprecate the
function. (And similarly htmlify, as it's unused)

Nicholas Clark

* There are several copies and derivatives of Pod​::HTML on CPAN - I couldn't
spot anything using Pod​::HTML​::anchorify

There wre other pod2html issue involving Unicode in v5.14, but I think
the current version fixed those.

I don't think we should support "guessed" encodings.

--tom

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 30, 2012

From @cpansprout

On Fri Mar 30 09​:03​:57 2012, tom christiansen wrote​:

Nicholas Clark (via RT) <perlbug-followup@​perl.org> wrote
on Fri, 30 Mar 2012 08​:54​:42 PDT​:

At first glance it would seem better to replace that \W with a
POSIX character class or Unicode property that reliably
expresses the intent.

Actually, I think \W is the correct thing.

If you want to follow HTML 5, it’s actually [^ \t\n\f\r], which I
believe is equivalent to (?aa​:\s).

There is nothing preventing anyone from having an anchor named #@​>$^,
except that it would have to be referenced as #%23%40%3E%24%5E; so, for
convenience, we might want to use (?​:[A-Za-z0-9._~-]|[^\0-\x7f]).

There wre other pod2html issue involving Unicode in v5.14, but I think
the current version fixed those.

I don't think we should support "guessed" encodings.

I think it was a mistake for such ever to have been supported by anything.

Father Chrysostomos

@toddr
Copy link
Member

toddr commented Feb 5, 2020

From @rjbs

There's a Google Summer of Code proposal to fix this, along with many
other Pod​::* problems in core. I suggest we not worry about this bug
until the fate of that proposal is clear.

@rjbs what happened with the GSOC proposal?

@rjbs
Copy link
Member

rjbs commented Feb 6, 2020

It became a project, but only a subset of it was ever delivered, as I recall. It's been nine years…

@jkeenan jkeenan added the ext/Pod-Html issues in the blead-upstream Pod-Html distribution label Jan 30, 2021
@jkeenan
Copy link
Contributor

jkeenan commented Jan 31, 2021

From horton-p@aist.go.jp

Background​: Tom Christiansen asked me to report the following as
a bug, in response to an email from me.

Symptom​: When running pod2html on files with non-ascii characters in link anchor names,
those characters are replaced with underscores.

Workaround​: In Pod/Html.pm, comment out line containing "$anchor =~ s/\W/_/g;"
This is line 2066 in Pod​::Html Version 1.08.
Perl Info

If I read the discussion in this ticket correctly -- particularly Tom's remark, there's nothing to be done in this ticket and the ticket should be closed.

Is my reading correct?

Thank you very much.
Jim Keenan

@jkeenan jkeenan self-assigned this Jan 31, 2021
@jkeenan jkeenan added Closable? We might be able to close this ticket, but we need to check with the reporter installhtml Problems with 'installhtml' program or 'make' target and removed affects-5.10 labels Jan 31, 2021
@jkeenan
Copy link
Contributor

jkeenan commented Feb 8, 2021

From horton-p@aist.go.jp

Background​: Tom Christiansen asked me to report the following as
a bug, in response to an email from me.
Symptom​: When running pod2html on files with non-ascii characters in link anchor names,
those characters are replaced with underscores.
Workaround​: In Pod/Html.pm, comment out line containing "$anchor =~ s/\W/_/g;"
This is line 2066 in Pod​::Html Version 1.08.
Perl Info

If I read the discussion in this ticket correctly -- particularly Tom's remark, there's nothing to be done in this ticket and the ticket should be closed.

Is my reading correct?

Thank you very much.
Jim Keenan

No one has argued that there is something remaining to be done in this ticket. Accordingly, closing.

Thank you very much.
Jim Keenan

@jkeenan jkeenan closed this as completed Feb 8, 2021
@jkeenan jkeenan removed the Closable? We might be able to close this ticket, but we need to check with the reporter label Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distro-All ext/Pod-Html issues in the blead-upstream Pod-Html distribution installhtml Problems with 'installhtml' program or 'make' target type-library type-Unicode type-utilities
Projects
None yet
Development

No branches or pull requests

4 participants