Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ß and pod2man #8826

Closed
p5pRT opened this issue Mar 7, 2007 · 11 comments
Closed

ß and pod2man #8826

p5pRT opened this issue Mar 7, 2007 · 11 comments

Comments

@p5pRT
Copy link

p5pRT commented Mar 7, 2007

Migrated from rt.perl.org#41737 (status was 'open')

Searchable as RT41737$

@p5pRT
Copy link
Author

p5pRT commented Mar 7, 2007

From @Abigail

Created by @Abigail

While documenting the quirks of Perls behaviour when it comes to
matching letters in the upper half of the ISO-Latin with \w,
I use an example. In this case, "ß" (that is, a German ringel-s
in ISO-Latin1.)

If I run this through pod2man, the "ß" translates into "\*8", which
gets formatted as "ss". This is wrong; it should leave the "ß" as is,
both in verbatim and in non-verbatim text.

pod2text and pod2html do output "ß". And so does pod2man in 5.8.8.

Perl Info

Flags:
    category=utilities
    severity=high

Site configuration information for perl 5.9.5:

Configured by abigail at Sat Mar  3 09:08:55 CET 2007.

Summary of my perl5 (revision 5 version 9 subversion 5 patch 30445) configuration:
  Platform:
    osname=linux, osvers=2.6.11-1.1369_fc4smp, archname=i686-linux-64int-ld
    uname='linux almanda 2.6.11-1.1369_fc4smp #1 smp thu jun 2 23:08:39 edt 2005 i686 i686 i386 gnulinux '
    config_args='-des -Dusedevel -Uversiononly -Dmydomain=.abigail.be -Dcf_email=abigail@abigail.be -Dperladmin=abigail@abigail.be -Doptimize=-g -Dcc=gcc -Dprefix=/opt/perl/current -Dusemorebits'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=undef, uselongdouble=define
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='-g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm'
    ccversion='', gccversion='4.0.2 20051125 (Red Hat 4.0.2-8)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long long', ivsize=8, nvtype='long double', nvsize=12, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=/lib/libc-2.3.5.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    DEVEL


@INC for perl 5.9.5:
    /home/abigail/Perl
    /opt/perl/current/lib/5.9.5/i686-linux-64int-ld
    /opt/perl/current/lib/5.9.5
    /opt/perl/current/lib/site_perl/5.9.5/i686-linux-64int-ld
    /opt/perl/current/lib/site_perl/5.9.5
    .


Environment for perl 5.9.5:
    HOME=/home/abigail
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH=/home/abigail/Lib:/usr/local/lib:/usr/lib:/lib:/usr/X11R6/lib
    LOGDIR (unset)
    PATH=/home/abigail/Bin:/opt/perl/bin:/usr/local/bin:/usr/local/X11/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/games:/usr/share/texmf/bin:/opt/Acrobat/bin:/opt/java/blackdown/j2sdk1.3.1/bin:/usr/local/games/bin
    PERL5LIB=/home/abigail/Perl
    PERLDIR=/opt/perl
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2007

From @andk

On Wed, 07 Mar 2007 13​:52​:37 -0800, Abigail (via RT) <perlbug-followup@​perl.org> said​:

  > # New Ticket Created by Abigail
  > # Please include the string​: [perl #41737]
  > # in the subject line of all future correspondence about this issue.
  > # <URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=41737 >

  > This is a bug report for perl from abigail@​abigail.be,
  > generated with the help of perlbug 1.35 running under perl 5.9.5.

  > -----------------------------------------------------------------
  > [Please enter your report here]

  > While documenting the quirks of Perls behaviour when it comes to
  > matching letters in the upper half of the ISO-Latin with \w,
  > I use an example. In this case, "ß" (that is, a German ringel-s
  > in ISO-Latin1.)

  > If I run this through pod2man, the "ß" translates into "\*8", which
  > gets formatted as "ss". This is wrong; it should leave the "ß" as is,
  > both in verbatim and in non-verbatim text.

  > pod2text and pod2html do output "ß". And so does pod2man in 5.8.8.

On my box it isn't "\*8" but "A\*X" (rendered as "AX") but this is
most probably the same bug. Introduced around 26292 which marks the
inclusion of podlators-2.00.

For completeness, here the output of my binary search run​:

  ----Program----

  =head1 [perl #41737] ß and pod2man

  pod2text and pod2html do output "ß". And so does pod2man in 5.8.8.

  =cut

  use File​::Spec;
  my $perl = File​::Spec-&gt;rel2abs($^X);

  use File​::Basename;
  my $pod2man = File​::Basename​::dirname($perl) . "/pod2man";

  open my $fh, "-|", $perl, $pod2man, $0 or die;
  while (<$fh>) {
  next unless /pod2text/;
  print;
  }

  ----Output of .../pJmPzPp/perl-5.8.0@​26291/bin/perl----
  \& pod2text and pod2html do output "ß". And so does pod2man in 5.8.8.

  ----EOF ($?='0')----
  ----Output of .../pe53taG/perl-5.8.0@​26295/bin/perl----
  \& pod2text and pod2html do output "A\*~X". And so does pod2man in 5.8.8.

  ----EOF ($?='0')----
  Need a perl between 26291 and 26295
  (but 26292, 26293, 26294 could not successfully be used to build perl)
  No useable patch available between 26291 and 26295
  Patches 26292, 26293, 26294 could not successfully be used to build perl

Change 26292 by stevep@​stevep-mccoy on 2005/12/07 12​:36​:59

  Upgrade to podlators-2.00

--
andreas

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2007

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Mar 8, 2007

From rra@stanford.edu

Andreas J Koenig <andreas.koenig.7os6VVqR@​franz.ak.mind.de> writes​:

-----------------------------------------------------------------
[Please enter your report here]

While documenting the quirks of Perls behaviour when it comes to
matching letters in the upper half of the ISO-Latin with \w,
I use an example. In this case, "ß" (that is, a German ringel-s
in ISO-Latin1.)

If I run this through pod2man, the "ß" translates into "\*8", which
gets formatted as "ss". This is wrong; it should leave the "ß" as is,
both in verbatim and in non-verbatim text.

pod2text and pod2html do output "ß". And so does pod2man in 5.8.8.

Older versions of pod2man had a bug that caused it to not translate
characters in verbatim blocks such as this one. The current version of
pod2man is consistent; that behavior was never intentional.

Now, unfortunately for this bug report, what it's consistent on is always
translating high-bit characters into something that traditional nroff can
cope with. Including high-bit characters verbatim in man pages can cause
all sorts of bizarre behavior from older versions of nroff. Pod​::Man has
always been conservative in this regard (even from before I rewrote Todd's
original version), modulo the occasional bug.

The root problem here is that it's very unclear what to output if one
isn't going to stick to a strict ASCII encoding. Do I output characters
in the user's native locale and just hope everything can cope? (What
happens if the locale is C? What happens if the man pages are installed
by someone with a locale of en_US.UTF-8 and then later viewed by someone
with a locale of C?) Do I use groff-specific escapes that produce the
correct characters? Do I always output UTF-8? There appears to be no
correct answer; this whole area appears to be hopelessly confused. I was
planning on at least adding a flag that says "assume groff," but according
to the latest information I've heard, groff's UTF-8 support still isn't
something that can be relied upon.

If someone can tackle putting together a complete plan for how to handle
non-ASCII characters in Pod​::Man, I'm happy to look at it, probably bless
it, and possibly even implement it, but designing it properly with lots of
testing on different Unix platforms is more than I have time for
immediately.

On my box it isn't "\*8" but "A\*X" (rendered as "AX") but this is
most probably the same bug.

No, this is a testing bug. You're missing an =encoding command, the file
is apparently written in UTF-8, and in the absence of any =encoding,
Pod​::Simple apparently decided that your locale was ISO-8859-1. This is
exactly the sort of problem that worries me about doing anything but the
most conservative thing for output from Pod​::Man.

--
Russ Allbery (rra@​stanford.edu) <http​://www.eyrie.org/~eagle/>

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2007

From @Abigail

On Wed, Mar 07, 2007 at 11​:24​:34PM -0800, Russ Allbery wrote​:

Andreas J Koenig <andreas.koenig.7os6VVqR@​franz.ak.mind.de> writes​:

-----------------------------------------------------------------
[Please enter your report here]

While documenting the quirks of Perls behaviour when it comes to
matching letters in the upper half of the ISO-Latin with \w,
I use an example. In this case, "?" (that is, a German ringel-s
in ISO-Latin1.)

If I run this through pod2man, the "?" translates into "\*8", which
gets formatted as "ss". This is wrong; it should leave the "?" as is,
both in verbatim and in non-verbatim text.

pod2text and pod2html do output "?". And so does pod2man in 5.8.8.

Older versions of pod2man had a bug that caused it to not translate
characters in verbatim blocks such as this one. The current version of
pod2man is consistent; that behavior was never intentional.

Now, unfortunately for this bug report, what it's consistent on is always
translating high-bit characters into something that traditional nroff can
cope with. Including high-bit characters verbatim in man pages can cause
all sorts of bizarre behavior from older versions of nroff. Pod​::Man has
always been conservative in this regard (even from before I rewrote Todd's
original version), modulo the occasional bug.

The root problem here is that it's very unclear what to output if one
isn't going to stick to a strict ASCII encoding. Do I output characters
in the user's native locale and just hope everything can cope? (What
happens if the locale is C? What happens if the man pages are installed
by someone with a locale of en_US.UTF-8 and then later viewed by someone
with a locale of C?) Do I use groff-specific escapes that produce the
correct characters? Do I always output UTF-8? There appears to be no
correct answer; this whole area appears to be hopelessly confused. I was
planning on at least adding a flag that says "assume groff," but according
to the latest information I've heard, groff's UTF-8 support still isn't
something that can be relied upon.

If someone can tackle putting together a complete plan for how to handle
non-ASCII characters in Pod​::Man, I'm happy to look at it, probably bless
it, and possibly even implement it, but designing it properly with lots of
testing on different Unix platforms is more than I have time for
immediately.

On my box it isn't "\*8" but "A\*X" (rendered as "AX") but this is
most probably the same bug.

No, this is a testing bug. You're missing an =encoding command, the file
is apparently written in UTF-8, and in the absence of any =encoding,
Pod​::Simple apparently decided that your locale was ISO-8859-1. This is
exactly the sort of problem that worries me about doing anything but the
most conservative thing for output from Pod​::Man.

So, how do I solve my problem (using a non-ASCII, LATIN-1) character in
a POD document? Using an ASCII character, or an UTF-8 character whose
code point is above 255 isn't an option as it's an example of Perls odd
behaviour of characters in the 128-255 range.

Should I fall back to using "\xXX" in the examples, with comments declaring
what the \xXX means?

Abigail

@p5pRT
Copy link
Author

p5pRT commented Mar 14, 2007

From rra@stanford.edu

Abigail <abigail@​abigail.be> writes​:

So, how do I solve my problem (using a non-ASCII, LATIN-1) character in
a POD document? Using an ASCII character, or an UTF-8 character whose
code point is above 255 isn't an option as it's an example of Perls odd
behaviour of characters in the 128-255 range.

Right now, unfortunately, if you need for that character to be displayed
properly when viewing the man page, it's not possible. pod2man will
attempt to generate nroff that will do something not completely
unreasonable, but it isn't going to preserve the literal character if that
character isn't present in ASCII.

Should I fall back to using "\xXX" in the examples, with comments
declaring what the \xXX means?

In the short run, I'm not sure I have a better alternative for you. :/

--
Russ Allbery (rra@​stanford.edu) <http​://www.eyrie.org/~eagle/>

@jkeenan
Copy link
Contributor

jkeenan commented Feb 22, 2020

From rra@stanford.edu

Abigail <abigail@​abigail.be> writes​:

So, how do I solve my problem (using a non-ASCII, LATIN-1) character in
a POD document? Using an ASCII character, or an UTF-8 character whose
code point is above 255 isn't an option as it's an example of Perls odd
behaviour of characters in the 128-255 range.

Right now, unfortunately, if you need for that character to be displayed
properly when viewing the man page, it's not possible. pod2man will
attempt to generate nroff that will do something not completely
unreasonable, but it isn't going to preserve the literal character if that
character isn't present in ASCII.

Should I fall back to using "\xXX" in the examples, with comments
declaring what the \xXX means?

In the short run, I'm not sure I have a better alternative for you. :/

--
Russ Allbery (rra@​stanford.edu) <http​://www.eyrie.org/~eagle/>

The short run appears to be the long run.

$ cat ghi-8826-pod2man.pod
=encoding UTF-8

=head1 Abigail's RT 41737

I use an example. In this case, "ß".

More text.

=cut

$ pod2man ghi-8826-pod2man.pod ghi-8826.man
GHI-8826-POD2MAN(1)           User Contributed Perl Documentation           GHI-8826-POD2MAN(1)

Abigail's RT 41737
       I use an example. In this case, "ss".

       More text.

perl v5.30.0                               2020-02-22                       GHI-8826-POD2MAN(1)

@Abigail @rra

@rra
Copy link
Contributor

rra commented Feb 22, 2020

The -u flag to the pod2man command, or the utf8 option to the Pod::Man constructor, will tell Pod::Man that it's okay to use UTF-8. You will then need groff or a similar modern *roff implementation to view the man page correctly.

I'm seriously considering changing the default to UTF-8 even though this is likely to cause at least some rendering problems because it bothers me that we're not rendering people's names properly and I'm not sure there's a better solution than going through that disruptive change.

@jkeenan
Copy link
Contributor

jkeenan commented Jan 30, 2021

The -u flag to the pod2man command, or the utf8 option to the Pod::Man constructor, will tell Pod::Man that it's okay to use UTF-8. You will then need groff or a similar modern *roff implementation to view the man page correctly.

I'm seriously considering changing the default to UTF-8 even though this is likely to cause at least some rendering problems because it bothers me that we're not rendering people's names properly and I'm not sure there's a better solution than going through that disruptive change.

@rra, have you given any further thought to this question?

(I can confirm that pod2man -u gives the results @Abigail thought desirable -- at least on Linux and FreeBSD.)

Thank you very much.
Jim Keenan

@khwilliamson
Copy link
Contributor

No response from @rra in a year; I have confirmed that -u does the right thing. Closing

@rra
Copy link
Contributor

rra commented Apr 14, 2022

For the record, I do still intend to fix this but the last couple of years have not been good for free time to spend on podlators. However, the next release will almost certainly make Unicode output the default, which I think is the correct resolution of this long-standing bug at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants