History of Lossless Data Compression Algorithms
There are two major categories of compression algorithms: lossy and lossless. Lossy compression algorithms involve the reduction of a file’s size usually by removing small details that require a large amount of data to store at full fidelity. In lossy compression, it is impossible to restore the original file due to the removal of essential data. Lossy compression is most commonly used to store image and audio data, and while it can achieve very high compression ratios through data removal, it is not covered in this article. Lossless data compression is the size reduction of a file, such that a decompression function can restore the original file exactly with no loss of data. Lossless data compression is used ubiquitously in computing, from saving space on your personal computer to sending data over the web, communicating over a secure shell, or viewing a PNG or GIF image.
The basic principle that lossless compression algorithms work on is that any non-random file will contain duplicated information that can be condensed using statistical modeling techniques that determine the probability of a character or phrase appearing. These statistical models can then be used to generate codes for specific characters or phrases based on their probability of occurring, and assigning the shortest codes to the most common data. Such techniques include entropy encoding, run-length encoding, and compression using a dictionary. Using these techniques and others, an 8-bit character or a string of such characters could be represented with just a few bits resulting in a large amount of redundant data being removed.
Data compression has only played a significant role in computing since the 1970s, when the Internet was becoming more popular and the Lempel-Ziv algorithms were invented, but it has a much longer history outside of computing. Morse code, invented in 1838, is the earliest instance of data compression in that the most common letters in the English language such as “e” and “t” are given shorter Morse codes. Later, as mainframe computers were starting to take hold in 1949, Claude Shannon and Robert Fano invented Shannon-Fano coding. Their algorithm assigns codes to symbols in a given block of data based on the probability of the symbol occuring. The probability is of a symbol occuring is inversely proportional to the length of the code, resulting in a shorter way to represent the data .
Two years later, David Huffman was studying information theory at MIT and had a class with Robert Fano. Fano gave the class the choice of writing a term paper or taking a final exam. Huffman chose the term paper, which was to be on finding the most efficient method of binary coding. After working for months and failing to come up with anything, Huffman was about to throw away all his work and start studying for the final exam in lieu of the paper. It was at that point that he had an epiphany, figuring out a very similar yet more efficient technique to Shannon-Fano coding. The key difference between Shannon-Fano coding and Huffman coding is that in the former the probability tree is built bottom-up, creating a suboptimal result, and in the latter it is built top-down .
Early implementations of Shannon-Fano and Huffman coding were done using hardware and hardcoded codes. It was not until the 1970s and the advent of the Internet and online storage that software compression was implemented that Huffman codes were dynamically generated based on the input data . Later, in 1977, Abraham Lempel and Jacob Ziv published their groundbreaking LZ77 algorithm, the first algorithm to use a dictionary to compress data. More specifically, LZ77 used a dynamic dictionary oftentimes called a sliding window . In 1978, the same duo published their LZ78 algorithm which also uses a dictionary; unlike LZ77, this algorithm parses the input data and generates a static dictionary rather than generating it dynamically 
Both the LZ77 and LZ78 algorithms grew rapidly in popularity, spawning many variants shown in the diagram to the right. Most of these algorithms have died off since their invention, with just a handful seeing widespread use today including DEFLATE, LZMA, and LZX. Most of the commonly used algorithms are derived from the LZ77 algorithm. This is not due to technical superiority, but because LZ78 algorithms became patent-encumbered after Unisys patented the derivative LZW algorithm in 1984 and began suing software vendors, server admins, and even end users for using the GIF format without a license .
At the time, the UNIX compress utility used a very slight modification of the LZW algorithm called LZC and was later discontinued due to patent issues. Other UNIX developers also began to deviate from using the LZW algorithm in favor or open source ones. This led the UNIX community to adopt the DEFLATE-based gzip and the Burrows-Wheeler Transform-based bzip2 formats. In the long run, this was a benefit for the UNIX community because both the gzip and bzip2 formats nearly always achieve significantly higher compression ratios than the LZW format . The patent issues surrounding LZW have since subsided, as the patent on the LZW algorithm expired in 2003 . Despite this, the LZW algorithm has largely been replaced and is only commonly used in GIF compression. There have also been some LZW derivatives since then but they do not enjoy widespread use either and LZ77 algorithms remain dominant.
Another legal battle erupted in 1993 regarding the LZS algorithm. LZS was developed by Stac Electronics for use in disk compression software such as Stacker. Microsoft used the LZS algorithm in developing disk compression software that was released with MS-DOS 6.0 that purported to double the capacity of a hard drive. When Stac Electronics found out its intellectual property was being used, it filed suit against Microsoft. Microsoft was later found guilty of patent infringement and ordered to pay Stac Electronics $120 million in damages minus $13.6 million awarded in a countersuit finding that Microsoft’s infringement was not willful . Although Stac Electronics v. Microsoft had a large judgment, it did not impede the development of Lempel-Ziv algorithms as the LZW patent dispute did. The only consequence seems to be that LZS has not been forked into any new algorithms.
The Rise of DEFLATE
Corporations and other large entities have used data compression since the Lempel-Ziv algorithms were published as they have ever-increasing storage needs and data compression allows them to meet those needs. However, data compression did not see widespread use until the Internet began to take off toward the late 1980s when a need for data compression emerged. Bandwidth was either limited, expensive, or both, and data compression helped to alleviate these bottlenecks. Compression became especially important when the World Wide Web was developed as people began to share more images and other formats which are considerably larger than text. To meet the demand, several new file formats were developed incorporating compression including ZIP, GIF, and PNG.
Thom Henderson released the first commercially successful archive format called ARC in 1985 through his company, System Enhancement Associates. ARC was especially popular in the BBS community, since it was one of the first programs capable of both bundling and compressing files and it was also made open source. The ARC format uses a modification to the LZW algorithm to compress data. A man named Phil Katz noticed ARC's popularity and decided to improve it by writing the compression and decompression routines in assembly language. He released his PKARC program as shareware in 1987 and was later sued by Henderson for copyright infringement. He was found guilty and forced to pay royalties and other penalties as part of a cross-licensing agreement. He was found guilty because PKARC was a blatant copy of ARC; in some instances even the typos in the comments were identical .
Phil Katz could no longer sell PKARC after 1988 due to the cross-licensing agreement, so in 1989 he created a tweaked version of PKARC that is now known as the ZIP format. As a result of its use of LZW, it was considered patent encumbered and Katz later chose to switch to the new IMPLODE algorithm. The format was again updated in 1993, when Katz released PKZIP 2.0, which implemented the DEFLATE algorithm as well as other features like split volumes. This version of the ZIP format is found ubiquitously today, as almost all .zip files follow the PKZIP 2.0 format despite its great age.
The GIF, or Graphics Interchange Format, was developed by CompuServe in 1987 to allow bitmaps to be shared without data loss (although the format is limited to 256 colors per frame), while substantially reducing the file size to allow transmission over the Internet. However, like the ZIP format, GIF is also based on the LZW algorithm. Despite being patent encumbered, Unisys was unable to enforce their patents adequately enough to stop the format from spreading. Even now, over 20 years later, the GIF remains in use especially for its capability of being animated.
Although GIF could not be stopped, CompuServe sought a format unencumbered by patents and in 1994 introduced the Portable Network Graphics (PNG) format. Like ZIP, the PNG standard uses the DEFLATE algorithm to perform compression. Although DEFLATE was patented by Katz  the patent was never enforced and thus PNG and other DEFLATE-based formats avoid infringing on patents. Although LZW took off in the early days of compression, due to Unisys's litigious nature it has more or less died off in favor of the faster and more efficient DEFLATE algorithm. DEFLATE is currently the most used data compression algorithm since it is a bit like the Swiss Army knife of compression.
Beyond its use in the PNG and ZIP formats, DEFLATE is also used very frequently elsewhere in computing. For example, the gzip (.gz) file format uses DEFLATE since it is essentially an open source version of ZIP. Other uses of DEFLATE include HTTP, SSL, and other technologies designed for efficient data compression over a network.
Sadly, Phil Katz did not live long enough to see his DEFLATE algorithm take over the computing world. He suffered from alcoholism for several years and his life began to fall apart in the late 1990s, having been arrested several times for drunk driving and other violations. Katz was found dead in a hotel room on April 14, 2000, at the age of 37. The cause of death was found to be acute pancreatic bleeding from his alcoholism, brought on by the many empty bottles of liquor found near his body .
Current Archival Software
The ZIP format and other DEFLATE-based formats were king up until the mid 1990s when new and improved formats began to emerge. In 1993, Eugene Roshal released his archiver known as WinRAR which utilizes the proprietary RAR format. The latest version of RAR uses a combination of the PPM and LZSS algorithms, but not much is known about earlier implementations. RAR has become a standard format for sharing files over the Internet, specifically in the distribution of pirated media. An open-source implementation of the Burrows-Wheeler Transform called bzip2 was introduced in 1996 and rapidly grew in popularity on the UNIX platform against the DEFLATE-based gzip format. Another open-source compression program was released in 1999 as the 7-Zip or .7z format. 7-Zip could be the first format to challenge the dominance of ZIP and RAR due to its generally high compression ratio and the format's modularity and openness. This format is not limited to using one compression algorithm, but can instead choose between bzip2, LZMA, LZMA2, and PPMd algorithms among others. Finally, on the bleeding edge of archival software are the PAQ* formats. The first PAQ format was released by Matt Mahoney in 2002, called PAQ1. PAQ substantially improves on the PPM algorithm by using a technique known as context mixing which combines two or more statistical models to generate a better prediction of the next symbol than either of the models on their own.
The future is never certain, but based on current trends some predictions can be made as to what may happen in the future of data compression. Context Mixing algorithms such as PAQ and its variants have started to attract popularity, and they tend to achieve the highest compression ratios but are usually slow. With the exponential increase in hardware speed following Moore's Law, context mixing algorithms will likely flourish as the speed penalties are overcome through faster hardware due to their high compression ratio. The algorithm that PAQ sought to improve, called Prediction by Partial Matching (PPM) may also see new variants. Finally, the Lempel-Ziv Markov chain Algorithm (LZMA) has consistently shown itself to have an excellent compromise between speed and high compression ratio and will likely spawn more variants in the future. LZMA may even be the "winner" as it is further developed, having already been adopted in numerous competing compression formats since it was introduced with the 7-Zip format. Another potential development is the use of compression via substring enumeration (CSE) which is an up-and-coming compression technique that has not seen many software implementations yet. In its naive form it performs similarly to bzip2 and PPM, and researchers have been working to improve its efficiency .
Many different techniques are used to compress data. Most compression techniques cannot stand on their own, but must be combined together to form a compression algorithm. Those that can stand alone are often more effective when joined together with other compression techniques. Most of these techniques fall under the category of entropy coders, but there are others such as Run-Length Encoding and the Burrows-Wheeler Transform that are also commonly used.
Run-Length Encoding is a very simple compression technique that replaces runs of two or more of the same character with a number which represents the length of the run, followed by the original character; single characters are coded as runs of 1. RLE is useful for highly-redundant data, indexed images with many pixels of the same color in a row, or in combination with other compression techniques like the Burrows-Wheeler Transform.
Here is a quick example of RLE:
The Burrows-Wheeler Transform is a compression technique that aims to reversibly transform a block of input data such that the amount of runs of identical characters is maximized. The BWT itself does not perform any compression operations, it simply transforms the input such that it can be more efficiently coded by a Run-Length Encoder or other secondary compression technique.
The algorithm for a BWT is simple:
- Create a string array.
- Generate all possible rotations of the input string, storing each in the array.
- Sort the array alphabetically.
- Return the last column of the array (the last character of each string in the array concatenated) .
BWT usually works best on long inputs with many alternating identical characters. Here is an example of the algorithm being run on an ideal input. Note that & is an End of File character:
|| Alpha-Sorted Rotations
Because of its alternating identical characters, performing the BWT on this input generates an optimal result that another algorithm could further compress, such as RLE which would yield "3H&3A". While this example generated an optimal result, it does not generate optimal results on real-world data.
Entropy in data compression means the smallest number of bits needed, on average, to represent a symbol or literal. A basic entropy coder combines a statistical model and a coder. The input file is parsed and used to generate a statistical model that consists of the probabilities of a given symbol appearing. Then, the coder will use the statistical model to determine what bit-or-bytecodes to assign to each symbol such that the most common symbols have the shortest codes and the least common symbols have the longest codes .
This is one of the earliest compression techniques, invented in 1949 by Claude Shannon and Robert Fano. This technique involves generating a binary tree to represent the probabilities of each symbol occurring. The symbols are ordered such that the most frequent symbols appear at the top of the tree and the least likely symbols appear at the bottom.
The code for a given symbol is obtained by searching for it in the Shannon-Fano tree, and appending to the code a value of 0 or 1 for each left or right branch taken, respectively. For example, if “A” is two branches to the left and one to the right its code would be “001”. Shannon-Fano coding does not always produce optimal codes due to the way it builds the binary tree from the bottom up. For this reason, Huffman coding is used instead as it generates an optimal code for any given input.
The algorithm to generate Shannon-Fano codes is fairly simple
- Parse the input, counting the occurrence of each symbol.
- Determine the probability of each symbol using the symbol count.
- Sort the symbols by probability, with the most probable first.
- Generate leaf nodes for each symbol.
- Divide the list in two while keeping the probability of the left branch roughly equal to those on the right branch.
- Prepend 0 and 1 to the left and right nodes' codes, respectively.
- Recursively apply steps 5 and 6 to the left and right subtrees until each node is a leaf in the tree. 
Huffman Coding is another variant of entropy coding that works in a very similar manner to Shannon-Fano Coding, but the binary tree is built from the top down to generate an optimal result.
The algorithm to generate Huffman codes shares its first steps with Shannon-Fano:
- Parse the input, counting the occurrence of each symbol.
- Determine the probability of each symbol using the symbol count.
- Sort the symbols by probability, with the most probable first.
- Generate leaf nodes for each symbol, including P, and add them to a queue.
- While (Nodes in Queue > 1)
- Remove the two lowest probability nodes from the queue.
- Prepend 0 and 1 to the left and right nodes' codes, respectively.
- Create a new node with value equal to the sum of the nodes’ probability.
- Assign the first node to the left branch and the second node to the right branch.
- Add the node to the queue
6. The last node remaining in the queue is the root of the Huffman tree. 
Arithmetic coding is arguably the most optimal entropy coding technique if the objective is the best compression ratio since it usually achieves better results than Huffman Coding. It is, however, quite complicated compared to the other coding techniques.
Rather than splitting the probabilities of symbols into a tree, arithmetic coding transforms the input data into a single rational number between 0 and 1 by changing the base and assigning a single value to each unique symbol from 0 up to the base. Then, it is further transformed into a fixed-point binary number which is the encoded result. The value can be decoded into the original output by changing the base from binary back to the original base and replacing the values with the symbols they correspond to.
A general algorithm to compute the arithmetic code is:
- Calculate the number of unique symbols in the input. This number represents the base b (e.g. base 2 is binary) of the arithmetic code.
- Assign values from 0 to b to each unique symbol in the order they appear.
- Using the values from step 2, replace the symbols in the input with their codes
- Convert the result from step 3 from base b to a sufficiently long fixed-point binary number to preserve precision.
- Record the length of the input string somewhere in the result as it is needed for decoding .
Here is an example of an encode operation, given the input “ABCDAABD”:
- Found 4 unique symbols in input, therefore base = 4. Length = 8
- Assigned values to symbols: A=0, B=1, C=2, D=3
- Replaced input with codes: “0.012300134” where the leading 0 is not a symbol.
- Convert “0.012311234” from base 4 to base 2: “0.011011000001112”
- Result found. Note in result that input length is 8.
Assuming 7-bit ASCII codes, the input is 56 bits long, while its arithmetic coding is just 14 bits long resulting in a very high compression ratio of 25%. This example demonstrates how arithmetic coding compresses well when given a limited character set.
Sliding Window Algorithms
Published in 1977, LZ77 is the algorithm that started it all. It introduced the concept of a 'sliding window' for the first time which brought about significant improvements in compression ratio over more primitive algorithms. While parsing the file, the LZ77 algorithm adds frequently used character combinations to a dictionary. When it finds the same combination again, it appends the next character found to the entry and creates a new dictionary entry. The dictionary used changes dynamically based on the sliding window as the file is parsed. For example, the sliding window could be 64MB which means that the dictionary will contain entries for the past 64MB of the input data. Compression of the data happens by replacing symbols and phrases wiith what is known as an offset-length pair, combined with the first symbol after the string. The offset is how far from the start of the input the desired string is, the length is how many characters to read, and the
Given an input "abbadabba" the output would look something like "abb(0,1,'d')(0,3,'a')" as in the example below:
|| (0, 1, 'd')|
||(0, 3, 'a')|
While this substitution appears slightly larger than the input, it generally achieves a smaller result given longer input data .
LZR is a modification of LZ77 invented by Michael Rodeh in 1981. The algorithm aims to be a linear time alternative to LZ77. However, the encoded pointers can point to any offset in the file which means LZR consumes a considerable amount of memory. Combined with its poor compression ratio (LZ77 is often superior) it is an unfeasible variant .
DEFLATE was invented by Phil Katz in 1993 and is the basis for the majority of compression tasks today. It simply combines an LZ77 or LZSS preprocessor with Huffman coding on the backend to achieve moderately compressed results in a short time.
DEFLATE64 is a proprietary extension of the DEFLATE algorithm which increases the dictionary size to 64kB (hence the name) and allows greater distance in the sliding window. It increases both performance and compression ratio compared to DEFLATE . However, the proprietary nature of DEFLATE64 and its modest improvements over DEFLATE has led to limited adoption of the format. Open source algorithms such as LZMA are generally used instead.
The LZSS, or Lempel-Ziv-Storer-Szymanski algorithm was first published in 1982 by James Storer and Thomas Szymanski. LZSS improves on LZ77 in that it can detect whether a substitution will decrease the filesize or not. If no size reduction will be achieved, the input is left as a literal in the output. Otherwise, the section of the input is replaced with an (offset, length) pair where the offset is how many bytes from the start of the input and the length is how many characters to read from that position . Another improvement over LZ77 comes from the elimination of the "next character" and uses just an offset-length pair.
Here is a brief example given the input " these theses" which yields " these(0,6)s" which saves just one byte, but saves considerably more on larger inputs.
LZSS is still used in many popular archive formats, the best known of which is RAR. It is also sometimes used for network data compression.
LZH was developed in 1987 and it stands for "Lempel-Ziv Huffman." It is a variant of LZSS that utilizes Huffman coding to compress the pointers, resulting in slightly better compression. However, the improvements gained using Huffman coding are negligible and the compression is not worth the performance hit of using Huffman codes .
LZB was also developed in 1987 by Timothy Bell et al as a variant of LZSS. Like LZH, LZB also aims to reduce the compressed file size by encoding the LZSS pointers more efficiently. The way it does this is by gradually increasing the size of the pointers as the sliding window grows larger. It can achieve higher compression than LZSS and LZH, but it is still rather slow compared to LZSS due to the extra encoding step for the pointers .
ROLZ stands for "Reduced Offset Lempel-Ziv" and its goal is to improve LZ77 compression by restricting the offset length to reduce the amount of data required to encode the offset-length pair. This derivative of LZ77 was first seen in 1991 in Ross Williams' LZRW4 algorithm. Other implementations include BALZ, QUAD, and RZM. Highly optimized ROLZ can achieve nearly the same compression ratios as LZMA; however, ROLZ suffers from a lack of popularity.
LZP stands for "Lempel-Ziv + Prediction." It is a special case of ROLZ algorithm where the offset is reduced to 1. There are several variations using different techniques to achieve either faster operation of better compression ratios. LZW4 implements an arithmetic encoder to achieve the best compression ratio at the cost of speed .
Ron Williams created this algorithm in 1991, introducing the concept of a Reduced-Offset Lempel-Ziv compression for the first time. LZRW1 can achieve high compression ratios while remaining very fast and efficient. Ron Williams also created several variants that improve on LZRW1 such asa LZRW1-A, 2, 3, 3-A, and 4 .
Jeff Bonwick created his Lempel-Ziv Jeff Bonwick algorithm in 1998 for use in the Solaris Z File System (ZFS). It is considered a variant of the LZRW algorithm, specifically the LZRW1 variant which is aimed at maximum compression speed. Since it is used in a file system, speed is especially important to ensure that disk operations are not bottlenecked by the compression algorithm.
The Lempel-Ziv-Stac algorithm was developed by Stac Electronics in 1994 for use in disk compression software. It is a modification to LZ77 which distinguishes between literal symbols in the output and offset-length pairs, in addition to removing the next encountered symbol. The LZS algorithm is functionally most similar to the LZSS algorithm .
The LZX algorithm was developed in 1995 by Jonathan Forbes and Tomi Poutanen for the Amiga computer. The X in LZX has no special meaning. Forbes sold the algorithm to Microsoft in 1996 and went to work for them, where it was further improved upon for use in Microsoft's cabinet (.CAB) format. This algorithm is also employed by Microsoft to compress Compressed HTML Help (CHM) files, Windows Imaging Format (WIM) files, and Xbox Live Avatars .
LZO was developed by Markus Oberhumer in 1996 whose development goal was fast compression and decompression. It allows for adjustable compression levels and requires only 64kB of additional memory for the highest compression level, while decompression only requires the input and output buffers. LZO functions very similarly to the LZSS algorithm but is optimized for speed rather than compression ratio .
The Lempel-Ziv Markov chain Algorithm was first published in 1998 with the release of the 7-Zip archiver for use in the .7z file format. It achieves better compression than bzip2, DEFLATE, and other algorithms in most cases. LZMA uses a chain of compression techniques to achieve its output. First, a modified LZ77 algorithm, which operates at a bitwise level rather than the traditional bytewise level, is used to parse the data. Then, the output of the LZ77 algorithm undergoes arithmetic coding. More techniques can be applied depending on the specific LZMA implementation. The result is considerably improved compression ratios over most other LZ variants mainly due to the bitwise method of compression rather than bytewise .
LZMA2 is an incremental improvement to the original LZMA algorithm, first introduced in 2009 in an update to the 7-Zip archive software. LZMA2 improves the multithreading capabilities and thus performance of the LZMA algorithm, as well as better handling of incompressible data resulting in slightly better compression.
Statistical Lempel-Ziv was a concept created by Dr. Sam Kwong and Yu Fan Ho in 2001. The basic principle it operates on is that a statistical analysis of the data can be combined with an LZ77-variant algorithm to further optimize what codes are stored in the dictionary .
LZ78 was created by Lempel and Ziv in 1978, hence the abbreviation. Rather than using a sliding window to generate the dictionary, the input data is parsed entirely to generate a static, unchanging dictionary . While parsing the file, the LZ78 algorithm adds frequently each newly encountered character or string of characters. When it finds the same combination again, it appends the next character found to the entry and creates a new dictionary entry with the result.
An input such as "abbadabba" would generate the output "(0,a)(0,b)(2,a)(0,d)(1,b)(2,a)." You can see how this was derived in the following example:
| Dictionary Index
 Wolfram, Stephen. A New Kind of Science. Champaign, IL: Wolfram Media, 2002. 1069. Print.
 Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54–58
 USPTO Patent #4814746 http://www.google.com/patents?hl=en&lr=&vid=USPAT4814746&id=Z2oWAAAAEBAJ&oi=fnd&dq=lempel+ziv+miller+wegman&printsec=abstract#v=onepage&q=lempel%20ziv%20miller%20wegman&f=false
 Shannon, C.E. (July 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27: 379–423.
 Huffman Coding
 Arithmetic Coding
 Modeling for compression
 comp.compression Frequently Asked Questions
 ZIP patentUS patent 5051745
 Katz death http://www.bbsdocumentary.com/library/CONTROVERSY/LAWSUITS/SEA/katzbio.txt
 ARC http://www.esva.net/~thom/philkatz.html
 DEFLATE64 benchmark http://www.binaryessence.com/dct/apc/en000263.htm
 compression via substring enumeration
 LZRW* http://www.ross.net/compression/
 LZX sold to MSFT http://www.linkedin.com/pub/jonathan-forbes/3/70a/a4b
 LZMA2 date http://www.7-zip.org/history.txt
 Statistical LZ
 Burrows M and Wheeler D (1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation