History of Lossless Data Compression Algorithms
From GHN
Contents 
Introduction
There are two major categories of compression algorithms: lossy and lossless. Lossy compression algorithms involve the reduction of a file’s size usually by removing small details that require a large amount of data to store at full fidelity. In lossy compression, it is impossible to restore the original file due to the removal of essential data. Lossy compression is most commonly used to store image and audio data, and while it can achieve very high compression ratios through data removal, it is not covered in this article. Lossless data compression is the size reduction of a file, such that a decompression function can restore the original file exactly with no loss of data. Lossless data compression is used ubiquitously in computing, from saving space on your personal computer to sending data over the web, communicating over a secure shell, or viewing a PNG or GIF image.
The basic principle that lossless compression algorithms work on is that any nonrandom file will contain duplicated information that can be condensed using statistical modeling techniques that determine the probability of a character or phrase appearing. These statistical models can then be used to generate codes for specific characters or phrases based on their probability of occurring, and assigning the shortest codes to the most common data. Such techniques include entropy encoding, runlength encoding, and compression using a dictionary. Using these techniques and others, an 8bit character or a string of such characters could be represented with just a few bits resulting in a large amount of redundant data being removed.
History
Data compression has only played a significant role in computing since the 1970s, when the Internet was becoming more popular and the LempelZiv algorithms were invented, but it has a much longer history outside of computing. Morse code, invented in 1838, is the earliest instance of data compression in that the most common letters in the English language such as “e” and “t” are given shorter Morse codes. Later, as mainframe computers were starting to take hold in 1949, Claude Shannon and Robert Fano invented ShannonFano coding. Their algorithm assigns codes to symbols in a given block of data based on the probability of the symbol occuring. The probability is of a symbol occuring is inversely proportional to the length of the code, resulting in a shorter way to represent the data [1].
Two years later, David Huffman was studying information theory at MIT and had a class with Robert Fano. Fano gave the class the choice of writing a term paper or taking a final exam. Huffman chose the term paper, which was to be on finding the most efficient method of binary coding. After working for months and failing to come up with anything, Huffman was about to throw away all his work and start studying for the final exam in lieu of the paper. It was at that point that he had an epiphany, figuring out a very similar yet more efficient technique to ShannonFano coding. The key difference between ShannonFano coding and Huffman coding is that in the former the probability tree is built bottomup, creating a suboptimal result, and in the latter it is built topdown [2].
Early implementations of ShannonFano and Huffman coding were done using hardware and hardcoded codes. It was not until the 1970s and the advent of the Internet and online storage that software compression was implemented that Huffman codes were dynamically generated based on the input data [1]. Later, in 1977, Abraham Lempel and Jacob Ziv published their groundbreaking LZ77 algorithm, the first algorithm to use a dictionary to compress data. More specifically, LZ77 used a dynamic dictionary oftentimes called a sliding window [3]. In 1978, the same duo published their LZ78 algorithm which also uses a dictionary; unlike LZ77, this algorithm parses the input data and generates a static dictionary rather than generating it dynamically [4]
Legal Issues
Both the LZ77 and LZ78 algorithms grew rapidly in popularity, spawning many variants shown in the diagram to the right. Most of these algorithms have died off since their invention, with just a handful seeing widespread use today including DEFLATE, LZMA, and LZX. Most of the commonly used algorithms are derived from the LZ77 algorithm. This is not due to technical superiority, but because LZ78 algorithms became patentencumbered after Unisys patented the derivative LZW algorithm in 1984 and began suing software vendors, server admins, and even end users for using the GIF format without a license [5][6].
At the time, the UNIX compress utility used a very slight modification of the LZW algorithm called LZC and was later discontinued due to patent issues. Other UNIX developers also began to deviate from using the LZW algorithm in favor or open source ones. This led the UNIX community to adopt the DEFLATEbased gzip and the BurrowsWheeler Transformbased bzip2 formats. In the long run, this was a benefit for the UNIX community because both the gzip and bzip2 formats nearly always achieve significantly higher compression ratios than the LZW format [6]. The patent issues surrounding LZW have since subsided, as the patent on the LZW algorithm expired in 2003 [5]. Despite this, the LZW algorithm has largely been replaced and is only commonly used in GIF compression. There have also been some LZW derivatives since then but they do not enjoy widespread use either and LZ77 algorithms remain dominant.
Another legal battle erupted in 1993 regarding the LZS algorithm. LZS was developed by Stac Electronics for use in disk compression software such as Stacker. Microsoft used the LZS algorithm in developing disk compression software that was released with MSDOS 6.0 that purported to double the capacity of a hard drive. When Stac Electronics found out its intellectual property was being used, it filed suit against Microsoft. Microsoft was later found guilty of patent infringement and ordered to pay Stac Electronics $120 million in damages minus $13.6 million awarded in a countersuit finding that Microsoft’s infringement was not willful [7]. Although Stac Electronics v. Microsoft had a large judgment, it did not impede the development of LempelZiv algorithms as the LZW patent dispute did. The only consequence seems to be that LZS has not been forked into any new algorithms.
The Rise of DEFLATE
Corporations and other large entities have used data compression since the LempelZiv algorithms were published as they have everincreasing storage needs and data compression allows them to meet those needs. However, data compression did not see widespread use until the Internet began to take off toward the late 1980s when a need for data compression emerged. Bandwidth was either limited, expensive, or both, and data compression helped to alleviate these bottlenecks. Compression became especially important when the World Wide Web was developed as people began to share more images and other formats which are considerably larger than text. To meet the demand, several new file formats were developed incorporating compression including ZIP, GIF, and PNG.
Thom Henderson released the first commercially successful archive format called ARC in 1985 through his company, System Enhancement Associates. ARC was especially popular in the BBS community, since it was one of the first programs capable of both bundling and compressing files and it was also made open source. The ARC format uses a modification to the LZW algorithm to compress data. A man named Phil Katz noticed ARC's popularity and decided to improve it by writing the compression and decompression routines in assembly language. He released his PKARC program as shareware in 1987 and was later sued by Henderson for copyright infringement. He was found guilty and forced to pay royalties and other penalties as part of a crosslicensing agreement. He was found guilty because PKARC was a blatant copy of ARC; in some instances even the typos in the comments were identical [16].
Phil Katz could no longer sell PKARC after 1988 due to the crosslicensing agreement, so in 1989 he created a tweaked version of PKARC that is now known as the ZIP format. As a result of its use of LZW, it was considered patent encumbered and Katz later chose to switch to the new IMPLODE algorithm. The format was again updated in 1993, when Katz released PKZIP 2.0, which implemented the DEFLATE algorithm as well as other features like split volumes. This version of the ZIP format is found ubiquitously today, as almost all .zip files follow the PKZIP 2.0 format despite its great age.
The GIF, or Graphics Interchange Format, was developed by CompuServe in 1987 to allow bitmaps to be shared without data loss (although the format is limited to 256 colors per frame), while substantially reducing the file size to allow transmission over the Internet. However, like the ZIP format, GIF is also based on the LZW algorithm. Despite being patent encumbered, Unisys was unable to enforce their patents adequately enough to stop the format from spreading. Even now, over 20 years later, the GIF remains in use especially for its capability of being animated.
Although GIF could not be stopped, CompuServe sought a format unencumbered by patents and in 1994 introduced the Portable Network Graphics (PNG) format. Like ZIP, the PNG standard uses the DEFLATE algorithm to perform compression. Although DEFLATE was patented by Katz [14] the patent was never enforced and thus PNG and other DEFLATEbased formats avoid infringing on patents. Although LZW took off in the early days of compression, due to Unisys's litigious nature it has more or less died off in favor of the faster and more efficient DEFLATE algorithm. DEFLATE is currently the most used data compression algorithm since it is a bit like the Swiss Army knife of compression.
Beyond its use in the PNG and ZIP formats, DEFLATE is also used very frequently elsewhere in computing. For example, the gzip (.gz) file format uses DEFLATE since it is essentially an open source version of ZIP. Other uses of DEFLATE include HTTP, SSL, and other technologies designed for efficient data compression over a network.
Sadly, Phil Katz did not live long enough to see his DEFLATE algorithm take over the computing world. He suffered from alcoholism for several years and his life began to fall apart in the late 1990s, having been arrested several times for drunk driving and other violations. Katz was found dead in a hotel room on April 14, 2000, at the age of 37. The cause of death was found to be acute pancreatic bleeding from his alcoholism, brought on by the many empty bottles of liquor found near his body [15].
Current Archival Software
The ZIP format and other DEFLATEbased formats were king up until the mid 1990s when new and improved formats began to emerge. In 1993, Eugene Roshal released his archiver known as WinRAR which utilizes the proprietary RAR format. The latest version of RAR uses a combination of the PPM and LZSS algorithms, but not much is known about earlier implementations. RAR has become a standard format for sharing files over the Internet, specifically in the distribution of pirated media. An opensource implementation of the BurrowsWheeler Transform called bzip2 was introduced in 1996 and rapidly grew in popularity on the UNIX platform against the DEFLATEbased gzip format. Another opensource compression program was released in 1999 as the 7Zip or .7z format. 7Zip could be the first format to challenge the dominance of ZIP and RAR due to its generally high compression ratio and the format's modularity and openness. This format is not limited to using one compression algorithm, but can instead choose between bzip2, LZMA, LZMA2, and PPMd algorithms among others. Finally, on the bleeding edge of archival software are the PAQ* formats. The first PAQ format was released by Matt Mahoney in 2002, called PAQ1. PAQ is a substantial improvement on the PPM algorithm which uses a technique known as context mixing which combines two or more statistical models to generate a better prediction of the next symbol than either of the models on their own.
Future Developments
The future is never certain, but based on current trends some predictions can be made as to what may happen in the future of data compression. Context Mixing algorithms such as PAQ and its variants have started to attract popularity, and they tend to achieve the highest compression ratios but are usually slow. With the exponential increase in hardware speed following Moore's Law, context mixing algorithms will likely flourish as the speed penalties are overcome through faster hardware due to their high compression ratio. The algorithm that PAQ sought to improve, called Prediction by Partial Matching (PPM) may also see new variants. Finally, the LempelZiv Markov chain Algorithm (LZMA) has consistently shown itself to have an excellent compromise between speed and high compression ratio and will likely spawn more variants in the future. LZMA may even be the "winner" as it is further developed, having already been adopted in numerous competing compression formats since it was introduced with the 7Zip format. Another potential development is the use of compression via substring enumeration (CSE) which is an upandcoming compression technique that has not seen many software implementations yet. In its naive form it performs similarly to bzip2 and PPM, and researchers have been working to improve its efficiency [20].
Compression Techniques
Many different techniques are used to compress data. Most compression techniques cannot stand on their own, but must be combined together to form a compression algorithm. Those that can stand alone are often more effective when joined together with other compression techniques. Most of these techniques fall under the category of entropy coders, but there are others such as RunLength Encoding and the BurrowsWheeler Transform that are also commonly used.
RunLength Encoding
RunLength Encoding is a very simple compression technique that replaces runs of two or more of the same character with a number which represents the length of the run, followed by the original character; single characters are coded as runs of 1. RLE is useful for highlyredundant data, indexed images with many pixels of the same color in a row, or in combination with other compression techniques like the BurrowsWheeler Transform.
Here is a quick example of RLE:
Input: AAABBCCCCDEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: 3A2B4C1D6E38A
BurrowsWheeler Transform
The BurrowsWheeler Transform is a compression technique that aims to reversibly transform a block of input data such that the amount of runs of identical characters is maximized. The BWT itself does not perform any compression operations, it simply transforms the input such that it can be more efficiently coded by a RunLength Encoder or other secondary compression technique.
The algorithm for a BWT is simple:
 Create a string array.
 Generate all possible rotations of the input string, storing each in the array.
 Sort the array alphabetically.
 Return the last column of the array (the last character of each string in the array concatenated).
BWT usually works best on long inputs with many alternating identical characters. Here is an example of the algorithm being run on an ideal input. Note that & is an End of File character:
Input 
Rotations 
AlphaSorted Rotations 
Output 
HAHAHA& 
HAHAHA& 
AHAHA&H 
HHH&AAA 
&HAHAHA 
AHA&HAH  
A&HAHAH 
A&HAHAH  
HA&HAHA 
HAHAHA&  
AHA&HAH 
HAHA&HA  
HAHA&HA 
HA&HAHA  
AHAHA&H 
&HAHAHA 
Because of its alternating identical characters, performing the BWT on this input generates an optimal result that another algorithm could further compress, such as RLE which would yield "3H&3A". While this example generated an optimal result, it does not generate optimal results on realworld data.
Entropy Encoding
Entropy in data compression means the smallest number of bits needed, on average, to represent a symbol or literal. A basic entropy coder combines a statistical model and a coder. The input file is parsed and used to generate a statistical model that consists of the probabilities of a given symbol appearing. Then, the coder will use the statistical model to determine what bitorbytecodes to assign to each symbol such that the most common symbols have the shortest codes and the least common symbols have the longest codes [8].
ShannonFano Coding
This is one of the earliest compression techniques, invented in 1949 by Claude Shannon and Robert Fano. This technique involves generating a binary tree to represent the probabilities of each symbol occurring. The symbols are ordered such that the most frequent symbols appear at the top of the tree and the least likely symbols appear at the bottom.
The code for a given symbol is obtained by searching for it in the ShannonFano tree, and appending to the code a value of 0 or 1 for each left or right branch taken, respectively. For example, if “A” is two branches to the left and one to the right its code would be “001”. ShannonFano coding does not always produce optimal codes due to the way it builds the binary tree from the bottom up. For this reason, Huffman coding is used instead as it generates an optimal code for any given input.
The algorithm to generate ShannonFano codes is fairly simple
 Parse the input, counting the occurrence of each symbol.
 Determine the probability of each symbol using the symbol count.
 Sort the symbols by probability, with the most probable first.
 Generate leaf nodes for each symbol.
 Divide the list in two while keeping the probability of the left branch roughly equal to those on the right branch.
 Prepend 0 and 1 to the left and right nodes' codes, respectively.
 Recursively apply steps 5 and 6 to the left and right subtrees until each node is a leaf in the tree. [9]
Huffman Coding
Huffman Coding is another variant of entropy coding that works in a very similar manner to ShannonFano Coding, but the binary tree is built from the top down to generate an optimal result.
The algorithm to generate Huffman codes shares its first steps with ShannonFano:
 Parse the input, counting the occurrence of each symbol.
 Determine the probability of each symbol using the symbol count.
 Sort the symbols by probability, with the most probable first.
 Generate leaf nodes for each symbol, including P, and add them to a queue.
 While (Nodes in Queue > 1)
 Remove the two lowest probability nodes from the queue.
 Prepend 0 and 1 to the left and right nodes' codes, respectively.
 Create a new node with value equal to the sum of the nodes’ probability.
 Assign the first node to the left branch and the second node to the right branch.
 Add the node to the queue
6. The last node remaining in the queue is the root of the Huffman tree. [10]
Arithmetic Coding
Arithmetic coding is arguably the most optimal entropy coding technique if the objective is the best compression ratio since it usually achieves better results than Huffman Coding. It is, however, quite complicated compared to the other coding techniques.
Rather than splitting the probabilities of symbols into a tree, arithmetic coding transforms the input data into a single rational number between 0 and 1 by changing the base and assigning a single value to each unique symbol from 0 up to the base. Then, it is further transformed into a fixedpoint binary number which is the encoded result. The value can be decoded into the original output by changing the base from binary back to the original base and replacing the values with the symbols they correspond to.
A general algorithm to compute the arithmetic code is:
 Calculate the number of unique symbols in the input. This number represents the base b (e.g. base 2 is binary) of the arithmetic code.
 Assign values from 0 to b to each unique symbol in the order they appear.
 Using the values from step 2, replace the symbols in the input with their codes
 Convert the result from step 3 from base b to a sufficiently long fixedpoint binary number to preserve precision.
 Record the length of the input string somewhere in the result as it is needed for decoding [11].
Here is an example of an encode operation, given the input “ABCDAABD”:
 Found 4 unique symbols in input, therefore base = 4. Length = 8
 Assigned values to symbols: A=0, B=1, C=2, D=3
 Replaced input with codes: “0.01230013_{4}” where the leading 0 is not a symbol.
 Convert “0.01231123_{4}” from base 4 to base 2: “0.01101100000111_{2}”
 Result found. Note in result that input length is 8.
Assuming 7bit ASCII codes, the input is 56 bits long, while its arithmetic coding is just 14 bits long resulting in a very high compression ratio of 25%. This example demonstrates how arithmetic coding compresses well when given a limited character set.
Compression Algorithms
Sliding Window Algorithms
LZ77
Published in 1977, LZ77 is the algorithm that started it all. It introduced the concept of a 'sliding window' for the first time which brought about significant improvements in compression ratio over more primitive algorithms. While parsing the file, the LZ77 algorithm adds frequently used character combinations to a dictionary. When it finds the same combination again, it appends the next character found to the entry and creates a new dictionary entry. The dictionary used changes dynamically based on the sliding window as the file is parsed. For example, the sliding window could be 64MB which means that the dictionary will contain entries for the past 64MB of the input data. Compression of the data happens by replacing symbols and phrases wiith what is known as an offsetlength pair, combined with the first symbol after the string. The offset is how far from the start of the input the desired string is, the length is how many characters to read, and the
Given an input "abbadabba" the output would look something like "abb(0,1,'d')(0,3,'a')" as in the example below:
Position 
Symbol 
Output 
0 
a 
a 
1 
b 
b 
2 
b 
b 
3 
a 
(0, 1, 'd') 
4 
d  
5 
a 
(0, 3, 'a') 
6 
b  
7 
b  
8 
a 
While this substitution appears slightly larger than the input, it generally achieves a smaller result given longer input data [3].
LZR
LZR is a modification of LZ77 invented by Michael Rodeh in 1981. The algorithm aims to be a linear time alternative to LZ77. However, the encoded pointers can point to any offset in the file which means LZR consumes a considerable amount of memory. Combined with its poor compression ratio (LZ77 is often superior) it is an unfeasible variant [12][18].
DEFLATE
DEFLATE was invented by Phil Katz in 1993 and is the basis for the majority of compression tasks today. It simply combines an LZ77 or LZSS preprocessor with Huffman coding on the backend to achieve moderately compressed results in a short time.
DEFLATE64
DEFLATE64 is a proprietary extension of the DEFLATE algorithm which increases the dictionary size to 64kB (hence the name) and allows greater distance in the sliding window. It increases both performance and compression ratio compared to DEFLATE [19]. However, the proprietary nature of DEFLATE64 and its modest improvements over DEFLATE has led to limited adoption of the format. Open source algorithms such as LZMA are generally used instead.
LZSS
The LZSS, or LempelZivStorerSzymanski algorithm was first published in 1982 by James Storer and Thomas Szymanski. LZSS improves on LZ77 in that it can detect whether a substitution will decrease the filesize or not. If no size reduction will be achieved, the input is left as a literal in the output. Otherwise, the section of the input is replaced with an (offset, length) pair where the offset is how many bytes from the start of the input and the length is how many characters to read from that position [21]. Another improvement over LZ77 comes from the elimination of the "next character" and uses just an offsetlength pair.
Here is a brief example given the input " these theses" which yields " these(0,6)s" which saves just one byte, but saves considerably more on larger inputs.
Index 
0  1 
2 
3 
4 
5 
6 
7 
8 
9 
10  11 
12 
Symbol 
t 
h 
e 
s 
e 

t 
h 
e 
s 
e 
s  
Substituted 

t 
h 
e 
s 
e 
( 
0 
, 
6 
) 
s 

LZSS is still used in many popular archive formats, the best known of which is RAR. It is also sometimes used for network data compression.
LZH
LZH was developed in 1987 and it stands for "LempelZiv Huffman." It is a variant of LZSS that utilizes Huffman coding to compress the pointers, resulting in slightly better compression. However, the improvements gained using Huffman coding are negligible and the compression is not worth the performance hit of using Huffman codes [12].
LZB
LZB was also developed in 1987 by Timothy Bell et al as a variant of LZSS. Like LZH, LZB also aims to reduce the compressed file size by encoding the LZSS pointers more efficiently. The way it does this is by gradually increasing the size of the pointers as the sliding window grows larger. It can achieve higher compression than LZSS and LZH, but it is still rather slow compared to LZSS due to the extra encoding step for the pointers [12].
ROLZ
ROLZ stands for "Reduced Offset LempelZiv" and its goal is to improve LZ77 compression by restricting the offset length to reduce the amount of data required to encode the offsetlength pair. This derivative of LZ77 was first seen in 1991 in Ross Williams' LZRW4 algorithm. Other implementations include BALZ, QUAD, and RZM. Highly optimized ROLZ can achieve nearly the same compression ratios as LZMA; however, ROLZ suffers from a lack of popularity.
LZP
LZP stands for "LempelZiv + Prediction." It is a special case of ROLZ algorithm where the offset is reduced to 1. There are several variations using different techniques to achieve either faster operation of better compression ratios. LZW4 implements an arithmetic encoder to achieve the best compression ratio at the cost of speed [17].
LZRW1
Ron Williams created this algorithm in 1991, introducing the concept of a ReducedOffset LempelZiv compression for the first time. LZRW1 can achieve high compression ratios while remaining very fast and efficient. Ron Williams also created several variants that improve on LZRW1 such asa LZRW1A, 2, 3, 3A, and 4 [22].
LZJB
Jeff Bonwick created his LempelZiv Jeff Bonwick algorithm in 1998 for use in the Solaris Z File System (ZFS). It is considered a variant of the LZRW algorithm, specifically the LZRW1 variant which is aimed at maximum compression speed. Since it is used in a file system, speed is especially important to ensure that disk operations are not bottlenecked by the compression algorithm.
LZS
The LempelZivStac algorithm was developed by Stac Electronics in 1994 for use in disk compression software. It is a modification to LZ77 which distinguishes between literal symbols in the output and offsetlength pairs, in addition to removing the next encountered symbol. The LZS algorithm is functionally most similar to the LZSS algorithm [23],
LZX
The LZX algorithm was developed in 1995 by Jonathan Forbes and Tomi Poutanen for the Amiga computer. The X in LZX has no special meaning. Forbes
LZO
blah
LZMA
blah
LZMA2
blah
Statistical LempelZiv
Dictionary Algorithms
LZ78
blah
LZW
blah
LZMW
blah
LZAP
blah
LZWL
blah
LZJ
blah
LZFG
blah
Nondictionary Algorithms
PPM
blah
bzip2
blah
PAQ
blah
Citations
[1] Wolfram, Stephen. A New Kind of Science. Champaign, IL: Wolfram Media, 2002. 1069. Print.
[2] Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54–58
[3] LZ77
[4] LZ78
[5] USPTO Patent #4814746 http://www.google.com/patents?hl=en&lr=&vid=USPAT4814746&id=Z2oWAAAAEBAJ&oi=fnd&dq=lempel+ziv+miller+wegman&printsec=abstract#v=onepage&q=lempel%20ziv%20miller%20wegman&f=false
[5] http://www.theregister.co.uk/1999/09/01/unisys_demands_5k_licence_fee/
[6] http://stephane.lesimple.fr/wiki/blog/lzop_vs_compress_vs_gzip_vs_bzip2_vs_lzma_vs_lzma2xz_benchmark_reloaded
[7] http://www.msversus.org/archive/stac.html
[8] http://www.cs.tau.ac.il/~dcor/Graphics/advslides/entropy.pdf
[9] Shannon, C.E. (July 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27: 379–423.
[10] Huffman Coding
[11] Arithmetic Coding
[12] Modeling for compression
[13] comp.compression Frequently Asked Questions
http://www.faqs.org/faqs/compressionfaq/part1/section7.html
[14] ZIP patentUS patent 5051745
[15] Katz death http://www.bbsdocumentary.com/library/CONTROVERSY/LAWSUITS/SEA/katzbio.txt
[16] ARC http://www.esva.net/~thom/philkatz.html
[17] LZP
[18] LZR
[19] DEFLATE64 benchmark http://www.binaryessence.com/dct/apc/en000263.htm
[20] compression via substring enumeration
[21] LZSS
[22] LZRW* http://www.ross.net/compression/
[23] LZS
[24] LZX sold to MSFT http://www.linkedin.com/pub/jonathanforbes/3/70a/a4b