Data compression is a process in which the size of a file is reduced by re-encoding the file data to use fewer bits of storage than the original file. A fundamental component of data compression is that the original file can be transferred or stored, recreated, and then used later (with a process called decompression).
A Brief History of Data Compression
As the Internet emerged in the 1970s, the relationship between file size and transfer speed became much more apparent. Mathematicians around the world addressed the problem for years, but it wasn’t until the Lempel-Ziv-Welch (LZW) universal lossless compression algorithms came on the scene in the mid-1980s that real benefits were realized. LZW compression was the first widely used data compression method implemented on computers and it is still used today (in various iterations): a large English text file can typically be compressed to about half its original size with LZW.
Morse code, invented in 1838, is the earliest instance of data compression in that the most common letters in the English language such as “e” and “t” are given shorter Morse codes.
Common data compression algorithms include:
Types of Data Compression
Today, there are many different types of algorithms and implementations that allow the everyday user to compress files, but some are more suited for certain applications. To better understand data compression in general, it’s easiest to split the process down into two main groups: lossy compression and lossless compression.
Lossy compression reduces file size by removing unnecessary bits of information. This type of compression is most commonly used on image, video, and audio files, where a perfect representation of the source media is not required.
For example, an MP3 audio file doesn’t contain all the audio information from the original recording. Instead, MP3 lossy compression removes sounds that humans can’t hear. Since the average human ear would not notice this difference, the result is a smaller file with minimal user impact.
The downside? The more heavily a file is compressed with lossy compression, the more noticeable the reduction in quality becomes. Also, lossy compression does not work well with files where all of the data is crucial (for example, compressing a spreadsheet would yield unusable results).
Lossless compression reduces file size without removing any bits of information. Instead, this format works by removing redundancies within data to reduce the overall file size. With lossless, it is possible to perfectly reconstruct the original file.
For example, the most common lossless compression format (ZIP) is often used for program files in Windows, as it preserves all the original information. Decompressing the file (unzipping) produces an executable program that would otherwise be useless with lossy.
Common lossless formats include PNG for images, FLAC for audio, and ZIP. Lossless formats for video are rare, as the source files would take up massive amounts of space.
Limitations of Data Compression
It’s important to note that compression is not infinite. Compressing a file into a ZIP may reduce its size, but it is impossible to continue compressing the file further and reduce the size to nothing.
Also, it’s important to understand the relationship between the two groups of data compression:
- Yes: Converting lossless files to lossy files
- Yes: Converting one lossless format to another lossless format is fine
- No: Converting lossy files to lossless files (lossy formats throw out data; it’s impossible to recover that data)
- No: Converting one lossy format to another lossy format
A Final Word on Data Compression
How does data compression work from a technical standpoint? Well, the actual algorithms that decide what data gets thrown out (in lossy methods) and how to best store redundant data (in lossless compression) are extremely complicated. This overview of data compression is meant to serve as a high-level overview of the basics and provide context for how to apply these practices in real-world situations.