This file belongs to the CEP package | Ten plik nale/zy do pakietu CEP
This package is public domain        | Pakiet stanowi dobro powszechne
For more info see `0CEP_LIC.ENG'     | Wi/ecej informacji w ,,0CEP_LIC.POL''
===========================================================================
`CEPCOP_E.INF' -- ENGLISH DOCUMENTATION

The amount of disk space occupied by bitmap graphics is a well-recognized
problem. For example, 300dpi picture A4 contains ca 8 700 000 pixels;
assuming that each CMYK pixel occupies four bytes, one obtains ca 35MB
of disk space needed to store the picture.

Now, imagine a poor TeX-er, who is not allowed to use binary graphic data
(because of the otherwise magnificent DVIPS), thus the poor TeX-er usually
converts the binary data to hexadecimal EPSes, thus doubling the required
space, and next, after compiling a document with TeX+DVIPS, the whole graphic
data is put into the resulting PS file, so the required space is doubled
again -- altogether 140MB per one A4 page. Night-mare begins...

This problem is not a new one, it was was recognised by Adobe relatively long
time ago. In the Level 2 specification they included objects called filters
which enable data compression. In particular, instead of hexadecimal data one
can use ASCII85 encoding (alike unix utility uuencode-uudecode), run length
compression, LZW compression, DCT (used in JPEG compressed files), and many
others. Why not to make use of these tools? The question is not as silly as
it may look at the first glance, as there exist relatively few applications
generating well-compressed PostScript graphics.

We decided to patch somehow this gap. We developed a little package enabling
the compression of ``normal'' (non-compressed) graphic data. The nature of
the problem is more complex, however, than one might expect. In particular,
a universal, always efficient compression technique does not exist. Hence the
package has several ``buttons'' which enable controlling various aspects of
compression.
                                *   *   *

Our package consists of four AWK programs, CEP.AWK-UNCEP.AWK and
COP.AWK-UNCOP.AWK. CEP.AWK and COP.AWK generate (on-the-fly) PostScript
programs which, processed by Ghostscript, yield the appropriate data
compression. UNCEP and UNCOP accomplish (using a similar technique) the
reverse process, i.e., uncompression.

CEP is devised for the compression of usual bitmap EPS files, containing
a single, hexadecimally coded image; COP can be used to compress any
PostScript data.

The question arises: why to use two packing techniques?  The answer is
simple: the efficiency of compression is higher if a compressing program
knows in advance which kind of data are to be expected.  In general, bitmaps
are more regular (redundant) than arbitrary PostScript data, hence even
simple algorithms turn out to be more efficient.

Tests show that in the best case (screen dumps) squeezing up to 10% of the
original size is nothing unusual. Sometimes, however, no compression method
gives a satisfactory result. In such a case, one can always use encoding data
using ASCII85 filter, obtaining a reduction of a hexadecimal bitmap size by
approximately 35%.

Below we give a brief description of CEP and COP. So far, only the MS DOS
version of the PS-compressors is available. In this version the GAWK-EMX.EXE
implementation of AWK and GS386.EXE Ghostscript interpreter are used.

We tested the package using several Ghostscript and GAWK implementations,
now we use Ghostscript 5.10 and GAWK 3.0.3. 

==========================================================================
                          C E P   AND   U N C E P
==========================================================================

The CEP subpackage consist of the MS DOS batch files CEP.BAT and UNCEP.BAT
and the AWK programs CEP.AWK and UNCEP.AWK. First, AWK inspects the source
EPS file doing its best to recognize a position of a hexadecimal bitmap, next
it creates an appropriate PostScript program, and then the control is passed
on to Ghostscript which just performs the submitted program: encodes the
bitmap and copies verbatim the remaining lines. The original preamble is
slightly modified; nevertheless, all DSC comments are left intact.

If the bitmap cannot be found or the AWK suspects that troubles may arise,
the CEP engine gives up.

The resulting file should be verified prior to removing the original one, 
as the CEP heuristic tricks may fail to fix the bitmap properly; moreover, 
due to GS bugs, premature removing of source may also be painful.

CEP never generates binary output -- only hexadecimal or ASCII85 encoding 
are supported. This is due to the fact that CEP-compressed EPS files are
primarily meant to be used in the contexts of TeX+DVIPS. Nevertheless, 
the resulting files can be used in other typesetting systems as so-called
placeable EPSes. The applicability to non-TeX application, however, 
is somewhat limited, as binary TIFF previews may be misinterpreted by (G)AWK.

UNCEP requires that a CEP-compressed file was not changed. In particular, 
it relies on the information in a quasi-DSC comment `%UNCEPInfo:'. This
information can be destroyed by a seemingly innocent modification (e.g., by
adding or removing a comment line). Note that the technique employed by CEP
destroys, by its nature, the information about the line-breaking structure 
of the hexadecimal bitmap. Therefore, UNCEP cannot retrieve the original file.
Line-breaking structure does not make any problem for a PostScript
interpreter. There exist programs, however, reading their own bitmap EPS
files, which for unknown reasons make use of such (sub)lexical information;
Aldus PhotoStyler is a notable example.

CEP USAGE: cep.bat input_file output_file [options]
           the program recognizes the following options:
                   8 -- use ASCII85 coding (default)
              h or H -- use HEX (hexadecimal) coding
              r or R -- use RLE (RunLength) compression (default)
              l or L -- use LZW compression
              f or F -- use Flate compression (non-standard!)
              n or N -- don't compress
NOTE: names of input_file and output_file must differ.

UNCEP USAGE: uncep.bat input_file output_file
NOTE: names of input_file and output_file must differ;
      decompression and decoding method is taken from input file.

==========================================================================
                         C O P   AND   U N C O P
==========================================================================

The subpackage consist of the MS DOS batch files COP.BAT and UNCOP.BAT, and
the AWK programs COP.AWK and UNCOP.AWK. COP reads and encodes appropriately the
supplied data. No analysis of the PostScript data is performed, as the entire
file is encoded without changing even a bit. The only aspect that is taken
into account is the DSC comment `%%BoundingBox:'; if it is found, COP inserts
this comments in the preamble, otherwise the resulting file does not contain
the bounding box information.

COP-generated files are readable to any PostScript Level 2 interpreter.

UNCOP scans the header and deduces from it the method of decompression, hence
no options are needed. UNCOP, unlike UNCEP, retrieves precisely the original
file. It is still recommended, however, that a user verifies whether the
resulting file is properly interpreted by GS. Due to GS bugs, premature
removing of the source file after compression or decompression may turn out
to be painful.

Since COP can be used to compress any data for arbitrary applications, also
binary encoding is allowed. The resulting files can be used typesetting
systems that accept so-called placeable EPSes. Unfortunately, binary TIFF
previews makes files after compression illegible for PostScript.

COP USAGE: cop.bat input_file output_file [options]
           the program recognizes the following options:
                  8 -- use ASCII85 coding (default)
             b or B -- use binary coding
             h or H -- use HEX (hexadecimal) coding
             r or R -- use RLE (RunLength) compression (default)
             l or L -- use LZW compression
             f or F -- use Flate compression (non-standard!)
             n or N -- don't compress
NOTE: names of input_file and output_file must differ;
      observe that binary encoding is, in fact, no encoding at all.

UNCOP USAGE: uncop.bat input_file output_file
NOTE: names of input_file and output_file must differ;
      decompression and decoding method is taken from input file.

=============================================================================
             A HEAP OF REMARKS CONCERNING  C E P  AND  C O P
=============================================================================

The applied solution addresses several problems:

  * It is not at all obvious how to determine syntactically
    where a hexadecimal bitmap begins in an EPS file; semantic analysis
    (by redefining PostScript primitives image, imagemask and colorimage)
    is possible, but it has also its limitations; anyway, we decided
    to recognize a bitmap syntactically, which implied a problem of
    recognizing such artefacts as `add' or `def' which look like
    fragments of a bitmap but, in fact, they are not.

  * Also, it is not obvious which compression method should be applied for
    a given data; usually, ASCII85 encoding is advisable; for pure bitmaps
    (CEP) RLE compression is satisfactory, although LZW and Flate filters 
    produce usually much better results (the latter seams to be the best); 
    nevertheless, both LZW and Flate encodings have limited usability:
       (a) LZW encoding is not implemented in GS ver. 4.x due to USA
           patent law; Aladdin implemented an LZW-compatible filter instead,
           which produces non-compressed data (in fact, enlarged by some 10%)
           readable for any LZWDecode filter. You can use old GS version,
           or compile a GS version containing the real LZW filter on your
           own risk, but...
       (b) Flate encoding (the same that is used in GZIP) is not available
           (yet?) on PostScript phototypesetters -- in the Ghostscript
           documentation one can find a moderately encouraging passage:
            ``Ghostscript also supports the as yet undocumented
              FlateEncode and FlateDecode filters from PDF 1.2 
              and (presumably) PostScript Level 3''
    As a rule of thumb we would suggest not to use any compression but
    ASCII85 for detailed colour photo images. It is just weakness of all
    non-lossy techniques -- algorithms employed by ARJ, ZIP, LHARC,
    and others would yield also poor results. A reasonable alternative
    for the data of this kind would be DCT (JPEG) compression.

  * As was mentioned above, ASCII85 encoding can usually be recommended;
    it added, however, some troubles. First, due to GS bugs, we decided
    to add the (dummy) NullEncode filter which seems to cure the problem.
    But there is one more problem: ASCII85 encoded bitmaps may contain
    lines looking like DSC comments, i.e, they may begin with double percent
    sign, %%, or with a pair percent-exclamation sign, %! -- why Adobe
    didn't exclude a percent from ASCII85?. Some programs may try to
    interprete maybe-DSC lines. For example, DVIPS just removes
    such lines, unless option -K0 is not used; on the other hand,
    leaving DSC comments intact may stupefy document managers.

  * It would be convenient to have some more filters implemented, in
    particular DCT and CCITTFax; both of them, however, make use of some 
    additional input data which makes using them more complex; moreover, 
    it is not clear whether one can find the optimal compression parameters 
    for DCT without a WYSIWYG program; we consider a possibility of 
    one-to-one conversion between JPEG files and EPS files making use of 
    DCT filters; also, a similar conversion between GIF files and EPS files
    making use of LZW filters can perhaps be implemented.

  * The package takes care of the working disk space -- no large temporary 
    files are created; roughly, the needed disk space is equal to the size 
    of the source + the size of the target.

  * In order to check whether a given phototypesetter is a genuine
    PostScript Level 2 interpreter, a trial-and-error method is necessary,
    since many commercial PostScript devices only claim to be Level 2
    compatible. The following file may be helpful for verifying
    the claims of the producer of a PostScript device:
   
        %!PS-Adobe-2.0 EPSF-1.2
        %%Pages: 1
        %%BoundingBox: 0 0 540 150
        %%EndComments
        /Helvetica 8 selectfont
        90 rotate
        1 2 moveto
        (*)
        {0 -10 rmoveto gsave show grestore}
        255 string
        /Filter
        resourceforall
        showpage
        %%EOF
   
    Running this program yields the list of filters for a given device.
    The error reported during the processing of this file proves that
    the device is not Level 2 compatible. In such a case, using the
    CEP package should be abandoned.

  * bugs and traps:

    (a) Apparently prepending `flushfile' to `closefile' neutralizes an
        error in GS 3.x (tail of output swallowed).

    (b) Adding (a dummy) NullEncode filter neutralizes (probably) another
        GS bug: ASCII85Encode filter with target procedure may produce
        superfluous EOD marks, i.e., ~> (if things go really bad you can 
        obtain thousands of them). Using the target procedure instead of 
        a file object excludes GS ver. < 3.x, because early Ghostscripts 
        didn't support all features of PostScript Level 2. Nevertheless, 
        GS ver. >= 2.6 can be used for compression with hexadecimal encoding 
        (it has the ``legal'' LZW compression)

    (c) the target procedure mentioned (b), in turn, is due to special
        treatment of the ASCII85 encoded lines looking like DSC comments;
        this special treatment is breaking lines after the first percent
        character. It is dedicated to the DVIPS driver which has a dangerous
        option `remove comments' (-K1)

    (d) an artificial form of quitting `{2 2 .quit}' instead of `{2 .quit}'
        is due to an infinite loop of GS 3.5x caused by the latter form.
        The GS internal operation `.quit' was chosen to provide error 
        handling at the level of operating system.

    (e) still, there exist bugs in older Ghostscripts that we were not able
        to neutralize; e.g., some EPS files are properly compressed 
        by GS 2.6, but GS 2.6 breaks while displaying them; GS 3.51
        behaves similarly with other bitmaps. So far, GS 4.x seems to be 
        the most resistant to the ``filter trial,'' but it also reveals  
        some deficiences.

    (f) summing up, we would strongly recommend using GS 4.x or 5.x
        (possibly with LZWEncode compiled in) and GAWK 3.x: GS 4.x is
        nearly complete implementation of the Level 2 PostScript;
        GAWK 3.x provides regular expressions for record separators,
        which makes possible to force to handle end-of-lines in exactly
        the same manner as PostScript does and, moreover, is more reliable
        than earlier versions.

==========================================================================
                               H I S T O R Y
==========================================================================

CEP+UNCEP:
   0.10 -- 16.03.97 -- first version
   0.20 -- 05.04.97 -- some obvious bugs removed
   0.30 -- 11.04.97 -- new method of prolog modification (processing complex 
                       prologs is enabled), and merging output file in 
                       Postscript (faster and less disk space needed)
                       
   0.35 -- 13.04.97 -- comments added (bilingual version)
   0.40 -- 14.04.97 -- significant improvement of performance
   0.50 -- 15.04.97 -- strings allocated statically, temporary files not 
                       created, (speed improved and demand for disk space 
                       slashed)
   0.60 -- 19.04.97 -- postScript error handling added, some GS bugs 
                       neutralized
   0.65 -- 20.04.97 -- exit code added, frame documentation provided
   0.70 -- 21.04.97 -- UNCEP added
   0.75 -- 24.04.97 -- problems of end-of-data and end-of-lines fixed;
                       documentation collected
   1.00 -- 02.05.97 -- public domain release (BachoTeX '97)
   1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust

COP+UNCOP:
   0.10 -- 06.04.97 -- first version
   0.20 -- 12.04.97 -- program structure unified with CEP, "cvx exec" used 
                       in place of "run"
   0.25 -- 13.04.97 -- comments added (bilingual version)
   0.30 -- 15.04.97 -- strings allocated statically (speed improved)
   0.40 -- 19.04.97 -- postScript error handling added, some GS bugs 
                       neutralized
   0.45 -- 20.04.97 -- exit code added, frame documentation provided
   0.50 -- 24.04.97 -- problems of end-of-lines fixed; documentation collected
   1.00 -- 02.05.97 -- public domain release (BachoTeX '97)
   1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust

==========================================================================
                            V O C A B U L A R Y
==========================================================================

Ghostscript, GS -- a magnificent interpreter of PostScript language
         by Aladdin Enterprise, available as a free public license product;
         its current version (4.03) turns out to be much more reliable
         than not a few commercial interpreters.
AWK   -- a utility and a programming language for convenient and efficient
         batch data-reformatting; written in 1977 by Alfred V. Aho, 
         Peter J. Weinberger, and Brian W. Kernighan.
GAWK  -- Gnu AWK, GNU Free Software Foundation implementation of AWK,
         written in 1986 by Paul Rubin and Jay Fenlason, with advice 
         from Richard Stallman.
GNU   -- The Free Software Foundation (FSF) is a non-profit organization
         dedicated to the production and distribution of freely 
         distributable software, founded by Richard M. Stallman.
TeX   -- public domain typesyetting system by Donald E. Knuth of
         Stanford University
DVIPS -- TeX-to-PostScript driver by Tomas Rokicki of Stanford University
DSC   -- Document Structuring Convention -- Adobe's standard
         for structuring PostScript documents.
ASCII85 -- PostScript algorithm of coding binary data as 7-bit ASCII
         text consisting of only printable characters; encodes every
         four bytes as five characters from `%' to `u'; additionaly
         `z' is used to code four zeros (see PostScript Language
         Reference Manual, second edition, pp. 128--130)
RLE   -- run length encoding -- a standard method of data compression
         (see PostScript Language Reference Manual, second edition,
         pp. 133--134)
LZW   -- an algorithm of data compression by J. Ziv, A. Lempel (1978),
         improved by T. Welch (1984); Unisys, at the time Welch's employer,
         was granted an US patent in 1985 on Welch's algorithm; a grandfather 
         clause was established by Unisys to make pre-1995 implementations 
         of LZW code free of royalty requirements, thereby eliminating such 
         claims on UNIX compress (information after Nelson H. F. Beebe, 
         e-mail beebe@math.utah.edu)
DCT   -- discrete cosine transform compression, an elaborated, very
         efficient but lossy compression scheme
JPEG  -- Joint Photographic Experts Group, an organization responsible
         for developing an international standard for compression of
         image data; PostScript (Level 2) DCTEncoding filter conforms
         to the JPEG-proposed standard.
GZIP  -- compressing tool by GNU Free Software Foundation, based on 
         superior and unpatented compression algorithm, developed in order 
         to get rid of the patented LZW algorithm.
=======================================================================
END OF `CEPCOP_E.INF'