History of Lossless Data Compression Algorithms

Introduction

There are two major categories of compression algorithms: lossy and lossless. Lossy compression algorithms involve the reduction of a file’s size usually by removing small details that require a large amount of data to store at full fidelity. In lossy compression, it is impossible to restore the original file due to the removal of essential data. Lossy compression is most commonly used to store image and audio data, and while it can achieve very high compression ratios through data removal, it is not covered in this article. Lossless data compression is the size reduction of a file, such that a decompression function can restore the original file exactly with no loss of data. Lossless data compression is used ubiquitously in computing, from saving space on your personal computer to sending data over the web, communicating over a secure shell, or viewing a PNG or GIF image.

The basic principle that lossless compression algorithms work on is that any non-random file will contain duplicated information that can be condensed using statistical modeling techniques that determine the probability of a character or phrase appearing. These statistical models can then be used to generate codes for specific characters or phrases based on their probability of occurring, and assigning the shortest codes to the most common data. Such techniques include entropy encoding, run-length encoding, and compression using a dictionary. Using these techniques and others, an 8-bit character or a string of such characters could be represented with just a few bits resulting in a large amount of redundant data being removed.

History

Data compression has only played a significant role in computing since the 1970s, when the Internet was becoming more popular and the Lempel-Ziv algorithms were invented, but it has a much longer history outside of computing. Morse code, invented in 1838, is the earliest instance of data compression in that the most common letters in the English language such as “e” and “t” are given shorter Morse codes. Later, as mainframe computers were starting to take hold in 1949, Claude Shannon and Robert Fano invented a systematic method to assign codes based on a given block of data [1]. Two years later, David Huffman was studying information theory at MIT and had a class with Robert Fano. Fano gave the class the choice of writing a term paper or taking a final exam. Huffman chose the term paper, which was to be on finding the most efficient method of binary coding. After working for months and failing to come up with anything, Huffman was about to throw away all his work and start studying for the final exam in lieu of the paper. It was at that point that he had an epiphany, figuring out a very similar yet more efficient technique to Shannon-Fano coding. The key difference between Huffman coding and Shannon-Fano coding is that in the former the probability tree is built bottom-up, creating a suboptimal result, and in the latter it is built top-down [2].

More to come in the next two days.

Overview

I'm a student at San Jose State University doing a semester project on the history of lossless data compression algorithms. You can see the course description here. This page will be updated as the semester progresses. I plan to include information about the evolution of Lempel-Ziv compression algorithms, as well as other alternative formats such as DEFLATE and DEFLATE64.

Questions I plan to answer

How have compression algorithms evolved over time?
What methods do they use to compress and how are they interrelated?
What real-world applications do they have, and how have they changed?
What priority do they place on speed vs. compression ratio?
What effect have software patents had on the development of compression algorithms?

Planned Algorithms to Cover

LZ77 (Lempel-Ziv #1) 1977
--LZSS (Lempel-Ziv-Storer-Szymanski) 1982
--LZRW (Lempel-Ziv Ross Williams) 1991
--LZJB (Lempel-Ziv Jeff Bonwick) 1998
--LZR (Lempel-Ziv-Renau) ???
----DEFLATE (ZIP, PNG, gzip, etc.) 1993
---------DEFLATE64 (Proprietary) 1999
--LZS (Lempel-Ziv-Stac) 1994
--LZX (Used in Microsoft .CAB) 1995
--LZP (Usually a preprocessor) 1995
--LZO (Lempel-Ziv-Oberhumer) 1996
--LZMA (Lempel-Ziv-Markov Chain) 1998

LZ78 (Lempel-Ziv #2) 1978
--LZW (Lempel-Ziv-Welch) 1984
-----LZMW (Syllable-based LZW) 1985
-----LZAP 1988
-----LZWL 2006

bzip2 (Algorithm Amalgam) 1996

PAQ* (Context Mixing) 2002
---PAQ8O (Faster, higher CR) 2007
---Many others

Statistical LZ
ROLZ (Reduced Offset LZ)

Methods used in these algorithms

Dictionary Compression (ubiquitous)
Prediction by Partial Matching (PPM, used by 7-Zip and RAR)
Context Mixing (Combines multiple statistical models)
Entropy Encoding
--Huffman Coding (Used in DEFLATE, bzip2, SSH, etc.)
-----Adaptive Huffman Coding (Dynamic scope)
-----Shannon-Fano Coding (Suboptimal compared to Huffman)
--Arithmetic & Range Encoding
--Rice & Golomb Coding (FLAC uses Rice, not used general purpose)
--Universal Coding (Slower than Huffman, uncommon)
Run-Length Encoding (AAAAAAABBBBA => 7A4BA)
Burrows-Wheeler Transform (Preprocessor for Run-length Encoding)