


catdoc(1)                                               catdoc(1)


NAME
       catdoc  - reads MS-Word file and puts its content as plain
       text on standard output

SYNOPSIS
       catdoc [-ubtawx] [-m number] [ -s charset] [  -d  charset]
       file


DESCRIPTION
       catdoc  behaves much like cat(1) but it reads MS-Word file
       and  produces  human-readable  text  on  standard  output.
       Optionally  it can use latex(1) escape sequenses for char-
       acters which have specail  meaning  for  LaTeX.   It  also
       makes some effort to recognize MS-Word tables, although it
       never tries to write correct  headers  for  LaTeX  tabular
       environment.  Additional  output formats, such is HTML can
       be easily defined.

       catdoc doesn't attempt to extract  formatting  information
       other than tables from MS-Word document, so different out-
       put modes means mainly that different charachers should be
       escaped  and  different ways used to represent characters,
       missing from output charset.  See  CHARACTER  SUBSTITUTION
       below


       catdoc uses internal unicode(4) representation of text, so
       it is able to convert texts when charset in  source  docu-
       ment  doesn't match charset on target system.  See CHARAC-
       TER SETS below.

       If no file names supplied, catdoc processes  its  standard
       input  unless it is terminal. It is unlikely that somebody
       could type Word  document  from  keyboard,  so  if  catdoc
       invoked  without arguments and stdin is not redirected, it
       prints brief usage message and exits.  Processing of stan-
       dard  input  (even  among other files) can be forced using
       dash '-' as file name.

       By default, catdoc wraps lines  which  are  more  than  72
       chars  long  and separates paragraphs by blank lines. This
       behavoir can be turned of by -w switch. In wide mode  cat-
       doc  prints  each paragraph as one long line, suitable for
       import into word processors which  perform  word  wrapping
       theirselves.



OPTIONS
       -a      -  shortcut  for  -f ascii. Produces ASCII text as
               output.  Separates table columns with TAB

       -b      - process broken MS-Word  file.  Normally,  catdoc



MS-Word reader             Version 0.90                         1





catdoc(1)                                               catdoc(1)


               checks  if  first 8 bytes of file is Microsoft OLE
               signature. If so, it processes file, otherwise  it
               just  copies  it  to  stdin. It is intended to use
               catdoc as filter for viewing all files  with  .doc
               extension.

       -dcharset
               - specifies destination charset name. Charset file
               has format described in CHARACTER SETS  below  and
               should  have  .txt extension  and reside in catdoc
               library  directory  (normally  /usr/local/lib/cat-
               doc).

       -fformat
               -  specifies output format as described in CHARAC-
               TER SUBSTITUTION below.   catdoc  comes  with  two
               output  formats  - ascii and tex. You can add your
               own if you wish.

       -mnumber
               Specifies right margin for text  (default 72).  -m
               0 is equivalent to -w

       -scharset
               Specifies  source charset. (one used in Word docu-
               ment), if Word  document  doesn't  contain  UTF-16
               text.

       -t      - shortcut for -f tex
                converts  all printable chars, which have special
               meaning  for  LaTeX(1)  into  appropriate  control
               sequences. Separates table columns by &.

       -u      -  declares  that Word  document  contain  UNICODE
               (UTF-16) represntation of text  (as  some  Word-97
               documents). If catdoc fails to correct  Word docu-
               ment with  default charset,   try    this  option.

       -w      disables  word  wrapping. By default catdoc output
               is splitted into lines  not  longer  than  72  (or
               number,  specified by -m  option)   characters and
               paragraphs are separated by blank line. With  this
               option each paragraph is one long line.

       -x      causes  catdoc to output unknown UNICODE characher
               as NN, instead of question marks.


CHARACTER SETS
       When processing MS-Word file catdoc uses information about
       two character sets, typically different
        -   input and output. They are stored in plain text files
       in catdoc library directory. Character  set  files  should
       contain  two  whitespace-separated  hexadecimal  numbers -



MS-Word reader             Version 0.90                         2





catdoc(1)                                               catdoc(1)


       8-bit code in character set and 16-bit unicode code.  Any-
       thing from hash mark to end of line is ignored, as well as
       blank lines.

       catdoc distribution includes some of these character sets.
       Additional  character  set definitions, directly usable by
       catdoc can be obtained from ftp.unicode.org. Charset files
       have .txt suffix, which shouldn't be specified in command-
       line or configuration files.

CHARACTER SUBSTITUTION
       catdoc converts  MS-Word file into following internal uni-
       code representation:

       1. Paragraphs are separated by ASCII Line Feed symbol
           (0x000A)

       2. Table cells within row are separated by ASCII Field
           Separator symbol
           (0x001C)

       3. Table rows are separated by ASCII Record Separator
           (0x001E)

       4. All printable characters, including whitespace are rep-
           resented with their
           respective UNICODE codes.

       This  UNICODE  representation  is  subsequentely converted
       into 8-bit text in target character  set  using  following
       four-step algorithm:

       1. List of special characters is searched for given
           unicode char- acter.
           If found, then appropriate multi-character sequence is
           output instead of character.

       2. If there is an equivalent in target character set, it
           is  out- put.

       3. Otherwise, replacement list  is  searched  and,  if
           there  is multi-character
           substitution for this UNICODE char, it is output.

       4. If all above fails, "Unknown char" symbol (question
           mark)  is output.

       Lists of special characters and list of  substitution  are
       character set-independent, becouse special chars should be
       escaped regardless of their existense in target  character
       set   (usially,  they are parts of US-ASCII, and therefore
       exist in  any  character  set)  and  replacement  list  is
       searched only for those characters, which are not found in
       target character set.



MS-Word reader             Version 0.90                         3





catdoc(1)                                               catdoc(1)


       These lists are stored  in  catdoc  library  directory  in
       files with prefix of format name. These files have follow-
       ing format:

       Each line can be either comment (starting with hash  mark)
       or contain hexadecimal UNICODE value, separated by whites-
       pace from string, which would be  substituted  instead  of
       it.  If string contain no whitespace it can be used as is,
       otherwise it  should  be  enclosed  in  single  or  double
       quotes.  Usial  backslash  sequences like '\n','\t' can be
       used in these string.



RUNTIME CONFIGURATION
       Upon startup catdoc reads  its  system-wide  configuration
       file  (  catdocrc  in  catdoc  library directory) and then
       user-specific configuration file ${HOME}/.catdocrc.

       These files can contain following directives:

       source_charset = charset-name
               Sets default source charset, which would  be  used
               if  no  -s option specified. Consult configuration
               of nearby windows  workstation  to  find  one  you
               need.

       target_charset = charset-name
                Sets  default  output charset. You probably know,
               which one you use.

       charset_path = directory-list
               colon-separated list  of  directories,  which  are
               searched  for  charset  files.  This allows you to
               install additional charsets in  your  home  direc-
               tory.

       map_path = directory-list
               colon-separated  list  of  directories,  which are
               searched for special character map and replacement
               map.

       format = format name
               Output  format  which  would  be  used by default.
               catdoc comes with two formats - ascii and tex  but
               nothing  prevents you from writing your own format
               (set two map files -  special  character  map  and
               replacement map).

       unknown_char = character specification
               sets  characher  to output instead of unknown uni-
               code character (default '?')  Character specifica-
               tion can have one of two form - character enclosed
               in single quotes or hexadecimal code.



MS-Word reader             Version 0.90                         4





catdoc(1)                                               catdoc(1)


BUGS
       Can produce garbage, if file  contain  embedded  illustra-
       tions.  Doesn't  handle  fast-saves properly. Prints foot-
       notes as separate paragraphs at the end of  file,  instead
       of  producing  correct  latex commands. Cannot distinguish
       between empty table cell and end of table row.




SEE ALSO
       cat(1), strings(1), utf(4), unicode(4)


AUTHOR
       V.B.Wagner <vitus@ice.ru>









































MS-Word reader             Version 0.90                         5


