This file belongs to the CEP package | Ten plik nale/zy do pakietu CEP This package is public domain | Pakiet stanowi dobro powszechne For more info see `0CEP_LIC.ENG' | Wi/ecej informacji w ,,0CEP_LIC.POL'' =========================================================================== `CEPCOP_E.INF' -- ENGLISH DOCUMENTATION The amount of disk space occupied by bitmap graphics is a well-recognized problem. For example, 300dpi picture A4 contains ca 8 700 000 pixels; assuming that each CMYK pixel occupies four bytes, one obtains ca 35MB of disk space needed to store the picture. Now, imagine a poor TeX-er, who is not allowed to use binary graphic data (because of the otherwise magnificent DVIPS), thus the poor TeX-er usually converts the binary data to hexadecimal EPSes, thus doubling the required space, and next, after compiling a document with TeX+DVIPS, the whole graphic data is put into the resulting PS file, so the required space is doubled again -- altogether 140MB per one A4 page. Night-mare begins... This problem is not a new one, it was was recognised by Adobe relatively long time ago. In the Level 2 specification they included objects called filters which enable data compression. In particular, instead of hexadecimal data one can use ASCII85 encoding (alike unix utility uuencode-uudecode), run length compression, LZW compression, DCT (used in JPEG compressed files), and many others. Why not to make use of these tools? The question is not as silly as it may look at the first glance, as there exist relatively few applications generating well-compressed PostScript graphics. We decided to patch somehow this gap. We developed a little package enabling the compression of ``normal'' (non-compressed) graphic data. The nature of the problem is more complex, however, than one might expect. In particular, a universal, always efficient compression technique does not exist. Hence the package has several ``buttons'' which enable controlling various aspects of compression. * * * Our package consists of four AWK programs, CEP.AWK-UNCEP.AWK and COP.AWK-UNCOP.AWK. CEP.AWK and COP.AWK generate (on-the-fly) PostScript programs which, processed by Ghostscript, yield the appropriate data compression. UNCEP and UNCOP accomplish (using a similar technique) the reverse process, i.e., uncompression. CEP is devised for the compression of usual bitmap EPS files, containing a single, hexadecimally coded image; COP can be used to compress any PostScript data. The question arises: why to use two packing techniques? The answer is simple: the efficiency of compression is higher if a compressing program knows in advance which kind of data are to be expected. In general, bitmaps are more regular (redundant) than arbitrary PostScript data, hence even simple algorithms turn out to be more efficient. Tests show that in the best case (screen dumps) squeezing up to 10% of the original size is nothing unusual. Sometimes, however, no compression method gives a satisfactory result. In such a case, one can always use encoding data using ASCII85 filter, obtaining a reduction of a hexadecimal bitmap size by approximately 35%. Below we give a brief description of CEP and COP. So far, only the MS DOS version of the PS-compressors is available. In this version the GAWK-EMX.EXE implementation of AWK and GS386.EXE Ghostscript interpreter are used. We tested the package using several Ghostscript and GAWK implementations, now we use Ghostscript 5.10 and GAWK 3.0.3. ========================================================================== C E P AND U N C E P ========================================================================== The CEP subpackage consist of the MS DOS batch files CEP.BAT and UNCEP.BAT and the AWK programs CEP.AWK and UNCEP.AWK. First, AWK inspects the source EPS file doing its best to recognize a position of a hexadecimal bitmap, next it creates an appropriate PostScript program, and then the control is passed on to Ghostscript which just performs the submitted program: encodes the bitmap and copies verbatim the remaining lines. The original preamble is slightly modified; nevertheless, all DSC comments are left intact. If the bitmap cannot be found or the AWK suspects that troubles may arise, the CEP engine gives up. The resulting file should be verified prior to removing the original one, as the CEP heuristic tricks may fail to fix the bitmap properly; moreover, due to GS bugs, premature removing of source may also be painful. CEP never generates binary output -- only hexadecimal or ASCII85 encoding are supported. This is due to the fact that CEP-compressed EPS files are primarily meant to be used in the contexts of TeX+DVIPS. Nevertheless, the resulting files can be used in other typesetting systems as so-called placeable EPSes. The applicability to non-TeX application, however, is somewhat limited, as binary TIFF previews may be misinterpreted by (G)AWK. UNCEP requires that a CEP-compressed file was not changed. In particular, it relies on the information in a quasi-DSC comment `%UNCEPInfo:'. This information can be destroyed by a seemingly innocent modification (e.g., by adding or removing a comment line). Note that the technique employed by CEP destroys, by its nature, the information about the line-breaking structure of the hexadecimal bitmap. Therefore, UNCEP cannot retrieve the original file. Line-breaking structure does not make any problem for a PostScript interpreter. There exist programs, however, reading their own bitmap EPS files, which for unknown reasons make use of such (sub)lexical information; Aldus PhotoStyler is a notable example. CEP USAGE: cep.bat input_file output_file [options] the program recognizes the following options: 8 -- use ASCII85 coding (default) h or H -- use HEX (hexadecimal) coding r or R -- use RLE (RunLength) compression (default) l or L -- use LZW compression f or F -- use Flate compression (non-standard!) n or N -- don't compress NOTE: names of input_file and output_file must differ. UNCEP USAGE: uncep.bat input_file output_file NOTE: names of input_file and output_file must differ; decompression and decoding method is taken from input file. ========================================================================== C O P AND U N C O P ========================================================================== The subpackage consist of the MS DOS batch files COP.BAT and UNCOP.BAT, and the AWK programs COP.AWK and UNCOP.AWK. COP reads and encodes appropriately the supplied data. No analysis of the PostScript data is performed, as the entire file is encoded without changing even a bit. The only aspect that is taken into account is the DSC comment `%%BoundingBox:'; if it is found, COP inserts this comments in the preamble, otherwise the resulting file does not contain the bounding box information. COP-generated files are readable to any PostScript Level 2 interpreter. UNCOP scans the header and deduces from it the method of decompression, hence no options are needed. UNCOP, unlike UNCEP, retrieves precisely the original file. It is still recommended, however, that a user verifies whether the resulting file is properly interpreted by GS. Due to GS bugs, premature removing of the source file after compression or decompression may turn out to be painful. Since COP can be used to compress any data for arbitrary applications, also binary encoding is allowed. The resulting files can be used typesetting systems that accept so-called placeable EPSes. Unfortunately, binary TIFF previews makes files after compression illegible for PostScript. COP USAGE: cop.bat input_file output_file [options] the program recognizes the following options: 8 -- use ASCII85 coding (default) b or B -- use binary coding h or H -- use HEX (hexadecimal) coding r or R -- use RLE (RunLength) compression (default) l or L -- use LZW compression f or F -- use Flate compression (non-standard!) n or N -- don't compress NOTE: names of input_file and output_file must differ; observe that binary encoding is, in fact, no encoding at all. UNCOP USAGE: uncop.bat input_file output_file NOTE: names of input_file and output_file must differ; decompression and decoding method is taken from input file. ============================================================================= A HEAP OF REMARKS CONCERNING C E P AND C O P ============================================================================= The applied solution addresses several problems: * It is not at all obvious how to determine syntactically where a hexadecimal bitmap begins in an EPS file; semantic analysis (by redefining PostScript primitives image, imagemask and colorimage) is possible, but it has also its limitations; anyway, we decided to recognize a bitmap syntactically, which implied a problem of recognizing such artefacts as `add' or `def' which look like fragments of a bitmap but, in fact, they are not. * Also, it is not obvious which compression method should be applied for a given data; usually, ASCII85 encoding is advisable; for pure bitmaps (CEP) RLE compression is satisfactory, although LZW and Flate filters produce usually much better results (the latter seams to be the best); nevertheless, both LZW and Flate encodings have limited usability: (a) LZW encoding is not implemented in GS ver. 4.x due to USA patent law; Aladdin implemented an LZW-compatible filter instead, which produces non-compressed data (in fact, enlarged by some 10%) readable for any LZWDecode filter. You can use old GS version, or compile a GS version containing the real LZW filter on your own risk, but... (b) Flate encoding (the same that is used in GZIP) is not available (yet?) on PostScript phototypesetters -- in the Ghostscript documentation one can find a moderately encouraging passage: ``Ghostscript also supports the as yet undocumented FlateEncode and FlateDecode filters from PDF 1.2 and (presumably) PostScript Level 3'' As a rule of thumb we would suggest not to use any compression but ASCII85 for detailed colour photo images. It is just weakness of all non-lossy techniques -- algorithms employed by ARJ, ZIP, LHARC, and others would yield also poor results. A reasonable alternative for the data of this kind would be DCT (JPEG) compression. * As was mentioned above, ASCII85 encoding can usually be recommended; it added, however, some troubles. First, due to GS bugs, we decided to add the (dummy) NullEncode filter which seems to cure the problem. But there is one more problem: ASCII85 encoded bitmaps may contain lines looking like DSC comments, i.e, they may begin with double percent sign, %%, or with a pair percent-exclamation sign, %! -- why Adobe didn't exclude a percent from ASCII85?. Some programs may try to interprete maybe-DSC lines. For example, DVIPS just removes such lines, unless option -K0 is not used; on the other hand, leaving DSC comments intact may stupefy document managers. * It would be convenient to have some more filters implemented, in particular DCT and CCITTFax; both of them, however, make use of some additional input data which makes using them more complex; moreover, it is not clear whether one can find the optimal compression parameters for DCT without a WYSIWYG program; we consider a possibility of one-to-one conversion between JPEG files and EPS files making use of DCT filters; also, a similar conversion between GIF files and EPS files making use of LZW filters can perhaps be implemented. * The package takes care of the working disk space -- no large temporary files are created; roughly, the needed disk space is equal to the size of the source + the size of the target. * In order to check whether a given phototypesetter is a genuine PostScript Level 2 interpreter, a trial-and-error method is necessary, since many commercial PostScript devices only claim to be Level 2 compatible. The following file may be helpful for verifying the claims of the producer of a PostScript device: %!PS-Adobe-2.0 EPSF-1.2 %%Pages: 1 %%BoundingBox: 0 0 540 150 %%EndComments /Helvetica 8 selectfont 90 rotate 1 2 moveto (*) {0 -10 rmoveto gsave show grestore} 255 string /Filter resourceforall showpage %%EOF Running this program yields the list of filters for a given device. The error reported during the processing of this file proves that the device is not Level 2 compatible. In such a case, using the CEP package should be abandoned. * bugs and traps: (a) Apparently prepending `flushfile' to `closefile' neutralizes an error in GS 3.x (tail of output swallowed). (b) Adding (a dummy) NullEncode filter neutralizes (probably) another GS bug: ASCII85Encode filter with target procedure may produce superfluous EOD marks, i.e., ~> (if things go really bad you can obtain thousands of them). Using the target procedure instead of a file object excludes GS ver. < 3.x, because early Ghostscripts didn't support all features of PostScript Level 2. Nevertheless, GS ver. >= 2.6 can be used for compression with hexadecimal encoding (it has the ``legal'' LZW compression) (c) the target procedure mentioned (b), in turn, is due to special treatment of the ASCII85 encoded lines looking like DSC comments; this special treatment is breaking lines after the first percent character. It is dedicated to the DVIPS driver which has a dangerous option `remove comments' (-K1) (d) an artificial form of quitting `{2 2 .quit}' instead of `{2 .quit}' is due to an infinite loop of GS 3.5x caused by the latter form. The GS internal operation `.quit' was chosen to provide error handling at the level of operating system. (e) still, there exist bugs in older Ghostscripts that we were not able to neutralize; e.g., some EPS files are properly compressed by GS 2.6, but GS 2.6 breaks while displaying them; GS 3.51 behaves similarly with other bitmaps. So far, GS 4.x seems to be the most resistant to the ``filter trial,'' but it also reveals some deficiences. (f) summing up, we would strongly recommend using GS 4.x or 5.x (possibly with LZWEncode compiled in) and GAWK 3.x: GS 4.x is nearly complete implementation of the Level 2 PostScript; GAWK 3.x provides regular expressions for record separators, which makes possible to force to handle end-of-lines in exactly the same manner as PostScript does and, moreover, is more reliable than earlier versions. ========================================================================== H I S T O R Y ========================================================================== CEP+UNCEP: 0.10 -- 16.03.97 -- first version 0.20 -- 05.04.97 -- some obvious bugs removed 0.30 -- 11.04.97 -- new method of prolog modification (processing complex prologs is enabled), and merging output file in Postscript (faster and less disk space needed) 0.35 -- 13.04.97 -- comments added (bilingual version) 0.40 -- 14.04.97 -- significant improvement of performance 0.50 -- 15.04.97 -- strings allocated statically, temporary files not created, (speed improved and demand for disk space slashed) 0.60 -- 19.04.97 -- postScript error handling added, some GS bugs neutralized 0.65 -- 20.04.97 -- exit code added, frame documentation provided 0.70 -- 21.04.97 -- UNCEP added 0.75 -- 24.04.97 -- problems of end-of-data and end-of-lines fixed; documentation collected 1.00 -- 02.05.97 -- public domain release (BachoTeX '97) 1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust COP+UNCOP: 0.10 -- 06.04.97 -- first version 0.20 -- 12.04.97 -- program structure unified with CEP, "cvx exec" used in place of "run" 0.25 -- 13.04.97 -- comments added (bilingual version) 0.30 -- 15.04.97 -- strings allocated statically (speed improved) 0.40 -- 19.04.97 -- postScript error handling added, some GS bugs neutralized 0.45 -- 20.04.97 -- exit code added, frame documentation provided 0.50 -- 24.04.97 -- problems of end-of-lines fixed; documentation collected 1.00 -- 02.05.97 -- public domain release (BachoTeX '97) 1.03 -- 07.01.98 -- documentation touched, package (CEP.AWK) more robust ========================================================================== V O C A B U L A R Y ========================================================================== Ghostscript, GS -- a magnificent interpreter of PostScript language by Aladdin Enterprise, available as a free public license product; its current version (4.03) turns out to be much more reliable than not a few commercial interpreters. AWK -- a utility and a programming language for convenient and efficient batch data-reformatting; written in 1977 by Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. GAWK -- Gnu AWK, GNU Free Software Foundation implementation of AWK, written in 1986 by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. GNU -- The Free Software Foundation (FSF) is a non-profit organization dedicated to the production and distribution of freely distributable software, founded by Richard M. Stallman. TeX -- public domain typesyetting system by Donald E. Knuth of Stanford University DVIPS -- TeX-to-PostScript driver by Tomas Rokicki of Stanford University DSC -- Document Structuring Convention -- Adobe's standard for structuring PostScript documents. ASCII85 -- PostScript algorithm of coding binary data as 7-bit ASCII text consisting of only printable characters; encodes every four bytes as five characters from `%' to `u'; additionaly `z' is used to code four zeros (see PostScript Language Reference Manual, second edition, pp. 128--130) RLE -- run length encoding -- a standard method of data compression (see PostScript Language Reference Manual, second edition, pp. 133--134) LZW -- an algorithm of data compression by J. Ziv, A. Lempel (1978), improved by T. Welch (1984); Unisys, at the time Welch's employer, was granted an US patent in 1985 on Welch's algorithm; a grandfather clause was established by Unisys to make pre-1995 implementations of LZW code free of royalty requirements, thereby eliminating such claims on UNIX compress (information after Nelson H. F. Beebe, e-mail beebe@math.utah.edu) DCT -- discrete cosine transform compression, an elaborated, very efficient but lossy compression scheme JPEG -- Joint Photographic Experts Group, an organization responsible for developing an international standard for compression of image data; PostScript (Level 2) DCTEncoding filter conforms to the JPEG-proposed standard. GZIP -- compressing tool by GNU Free Software Foundation, based on superior and unpatented compression algorithm, developed in order to get rid of the patented LZW algorithm. ======================================================================= END OF `CEPCOP_E.INF'