How are zlib, gzip, and zip related? What do they have in common and what are their differences?

The compression algorithm used in zlib is almost the same as in gzip and zip. How do they differ?

Author: andreymal, 2019-05-27

1 answers

tl;dr:

Zip is an archive format that usually uses the Deflate algorithm for compression; a zip archive can contain several files that are compressed separately from each other. gzip compresses exactly one file (this one file can be a tar archive) and also uses the Deflate algorithm. The zlib library implements Deflate and is used in zip, gzip, png and many other applications.


Format zip was developed by Phil Katz as an open format, but its implementation, PKZIP, was shareware. This archive format stores files and directory structure, and each file is compressed independently of the other files. Files and directory structure can also be encrypted.

The ZIP format supports several compression methods:

0 - The file is stored (no compression)
1 - The file is Shrunk
2 - The file is Reduced with compression factor 1
3 - The file is Reduced with compression factor 2
4 - The file is Reduced with compression factor 3
5 - The file is Reduced with compression factor 4
6 - The file is Imploded
7 - Reserved for Tokenizing compression algorithm
8 - The file is Deflated
9 - Enhanced Deflating using Deflate64(tm)
10 - PKWARE Data Compression Library Imploding (old IBM TERSE)
11 - Reserved by PKWARE
12 - File is compressed using BZIP2 algorithm
13 - Reserved by PKWARE
14 - LZMA (EFS)
15 - Reserved by PKWARE
16 - Reserved by PKWARE
17 - Reserved by PKWARE
18 - File is compressed using IBM TERSE (new)
19 - IBM LZ77 z Architecture (PFS)
97 - WavPack compressed data
98 - PPMd version I, Rev 1

Methods 1 through 7 are historical and are not used. Methods 9 to 98 were added relatively recently and they are used infrequently. The only widely used method is method 8, Deflate and, to some extent, method 0, which simply stores files without compression. Most zip files you will encounter will only use methods 8 and 0.

The ISO/IEC 21320-1:2015 standard for file containers is a limited zip format used in Java files (.jar), Office Open XML (Microsoft Office .docx, .xlsx, .pptx), Office Document Format (. odt, .ods,. odp) and EPUB (.epub). This standard allows only compression methods 0 and 8, and also has other limitations such as no encryption or signatures.

Around 1990, the Info-ZIP group wrote portable and free implementations of the zip and unzip utilities with support for Deflate compression and decompression of older formats. This greatly expanded the use of the .zip format.

In the early 90'sthe gzip format was developed to replace the Unix utility compress. The compress utility compressed exactly one file and added the .Z extension to the file. It used the LZW algorithm, which was protected by patents at the time, making it difficult to use. Although some specific Deflate implementations were patented by Phil Katz, the format itself is not protected by patents, so it is possible to write a Deflate implementation without infringing on patents. The gzip utility was conceived as a transparent replacement for the compress utility and can actually decompress compressed compress om data. gzip when when compressed, it adds the extension .gz to the file name. gzip uses the Deflate compression format, which allows for slightly better compression than compress, decompresses quickly, and checks data integrity with CRC-32. The gzip format header also allows you to store the original file name and the time it was modified.

The compress utility compresses exactly one file, and to compress multiple files while preserving the attributes and directory structure, tar archives were usually created, which were then compressed using compress to get the .tar.Z file. To avoid compressing everything manually, the tar utility had and still has an option to automatically enable compression using compress; with the advent of gzip, an option was also added to enable gzip compression and create .tar.gz files. Archives .tar.gz compress better than zip, since files are not compressed individually and gzip has the ability to work more efficiently with redundancy, especially if there are many small files in the archive. .tar.gz is widely used in Unix because of its good portability, but there are also more efficient compression methods, so you may encounter archives .tar.bz2 and .tar.xz.

As opposed to .tar, .zip archives store a list of files separately at the end of the file. Given that each file is compressed independently, this allows you to read individual files in the archive without decompressing the entire archive. And the archive .tar must be unpacked and read in its entirety to even just see the list of files in the archive, and this creates difficulties in some cases situations.

Shortly after the introduction of gzip, around the mid-1990s, the same patent problems called into question the free use of the .gif image format, very widely used on bulletin boards (BBS) and on the World Wide Web (WWW - a novelty of the time). So a small group created PNG-a lossless image format-to replace GIF. PNG also uses the Deflate format for compression. To promote the widespread use of the format PNG, two free libraries have been created: libpng implements all the functions of the PNG format, and zlib provides compression and decompression code for use in libpng and for other applications. zlib was adapted from the gzip code.

All the mentioned patents have already expired.

The zlib library supports Deflate, as well as three wrappers for Deflate streams. These include: no wrapper at all ("raw" Deflate), a zlib wrapper that is used in PNG data blocks, and a gzip wrapper. The main difference between zlib and gzip wrappers is that the zlib wrapper is more compact: only 6 bytes versus at least 18 bytes for gzip, and the Adler-32 integrity check is faster than the CRC-32 used in gzip. The raw Deflate method is used by programs that read and write the .zip format, which works with Deflate data in its own way.

The zlib library is used very widely, including as an implementation gzip format; for example, it is used in most browsers and web servers to compress data transmitted in HTTP (gzip format is usually used, since the zlib format has some historical difficulties).

Different Deflate implementations can compress the same data with different efficiencies, as evidenced by the presence of selectable compression levels that allow you to choose between compression efficiency and time spent. zlib and PKZIP-do not the only implementations of Deflate. 7-Zip or zopfli can take a very long time to compress data several percent more efficiently than zlib. The pigz utility, a multithreaded implementation of gzip, can use zlib (compression levels 1-9) or zopfli (compression level 11) and somewhat reduces the time spent by compressing large files on multiple processor cores.


Slightly loose translation response from Mark Adler on enSO with some additions.

 6
Author: andreymal, 2019-05-27 10:21:31