Losslessly Compress- Techniques and Algorithms
What Lossless Compression Actually Means
Lossless compression shrinks files without destroying any data. When you decompress, you get exactly what you started with. No quality loss. No artifacts. No approximations.
This isn't magic. The algorithm finds patterns, redundancies, and inefficiencies in the original data and encodes them more efficiently. The goal is simple: smaller file, identical content.
You need this when every bit matters—source code, spreadsheets, database dumps, executable files, and any data where losing information would break functionality.
How Lossless Compression Works
Every compression algorithm relies on one core principle: redundancy elimination. Data, whether it's text, images, or audio, contains patterns. The algorithm identifies these patterns and replaces them with shorter representations.
There are two main approaches:
- Statistical methods — Assign shorter codes to more frequent symbols. Huffman coding is the classic example.
- Dictionary methods — Build a dictionary of repeated sequences and replace them with references. LZW pioneered this approach.
Most modern algorithms combine both techniques. DEFLATE, used in ZIP and PNG, layers LZ77 (dictionary-based) with Huffman coding (statistical).
Common Lossless Compression Algorithms
Huffman Coding
This algorithm assigns variable-length codes to symbols based on their frequency. Common symbols get short codes. Rare symbols get longer codes.
Example: In English text, the letter "E" appears far more often than "Z". Huffman coding gives "E" a 3-bit code and "Z" a 12-bit code. The result is smaller than using fixed 8-bit codes for every character.
Huffman coding rarely stands alone. It's usually paired with other methods as the final encoding step.
Lempel-Ziv-Welch (LZW)
LZW builds a dictionary of strings as it reads data. When it encounters a sequence it has seen before, it outputs a reference to that dictionary entry instead of the raw characters.
This works incredibly well on repetitive data. A text file with the word "compression" appearing 500 times? LZW will compress that drastically.
You'll find LZW in GIF images and the original UNIX compress utility. It's fast and effective, but the dictionary can grow large on diverse data.
DEFLATE
DEFLATE is the workhorse of lossless compression. It combines two techniques:
- LZ77 — Finds repeated sequences and replaces them with back-references (distance + length)
- Huffman coding — Encodes the resulting symbols with optimal variable-length codes
This combination gives you the pattern-matching power of dictionary methods with the statistical efficiency of Huffman coding. ZIP files, PNG images, gzip, and HTTP compression all use DEFLATE.
It's not the most aggressive compressor, but it offers a good balance between compression ratio and speed.
Arithmetic Coding
Arithmetic coding represents entire messages as a single number within a range. Instead of assigning codes to individual symbols, it encodes the entire stream as one fractional value between 0 and 1.
This approach gets closer to the theoretical compression limit than Huffman coding. It handles fractional bits properly, which Huffman cannot.
The tradeoff: arithmetic coding is slower and more complex. It's used in JPEG 2000, H.264, and H.265 video compression, but rarely in everyday file formats.
Brotli
Brotli is Google's 2015 algorithm, designed primarily for web compression. It uses a combination of LZ77, Huffman coding, and context modeling—essentially a more sophisticated version of DEFLATE.
Brotli typically achieves 15-25% better compression than DEFLATE/gzip on text-based content. It's now supported by all major browsers and is the standard for HTTPS compression.
Lossless Compression File Formats
Different file types call for different approaches. Here are the main formats and what uses them:
- ZIP, GZIP, 7z — General-purpose archive formats using DEFLATE or other algorithms
- PNG — Image format using DEFLATE; supports transparency without quality loss
- GIF — Image format using LZW; limited to 256 colors
- FLAC — Audio format using linear prediction; CD-quality audio at roughly 60% of the original size
- ALAC — Apple's lossless audio codec
- WebP — Google's image format supporting both lossy and lossless modes
- AVIF — Modern image format with lossless support using AV1 compression
When to Use Lossless Compression
Lossless isn't always the right choice. Here's when it makes sense:
- Data integrity is non-negotiable — Source code, executables, databases, archives
- Multiple compression cycles — If you're compressing already-compressed data repeatedly, lossy will degrade but lossless won't
- Editing requirements — Lossless images can be edited and recompressed without accumulating artifacts
- Professional audio work — When you need the exact original signal for mixing or processing
And when to skip it:
- Final delivery of photos for web — JPEG at 85% quality is visually identical to the original at a fraction of the size
- Streaming video — H.264/H.265 HEVC lossy compression is standard because lossless video would be impossibly large
- Maximum compression needs — If file size matters more than perfect reconstruction, lossy gets better ratios
Lossless Compression Tools Compared
Here's how the common tools stack up:
| Tool | Algorithm | Compression Ratio | Speed | Best Use Case |
|---|---|---|---|---|
| 7-Zip | LZMA/LZMA2 | Excellent | Slow | Maximum compression for archives |
| gzip | DEFLATE | Good | Fast | Server-side web compression, logs |
| bzip2 | Burrows-Wheeler | Better than gzip | Medium | Text files, source code |
| xz | LZMA2 | Excellent | Slow | Distribution packages, backups |
| zstd | Zstandard | Excellent | Fast | Real-time compression, databases |
| brotli | Brotli | Better than gzip | Medium | Web content delivery |
| pngquant | Lossy + PNG | Good | Fast | PNG images specifically |
zstd (Zstandard) from Facebook is worth highlighting. It offers compression ratios competitive with DEFLATE while achieving throughputs 3-5x faster. It's now used by the Linux kernel, Cassandra, and Redis.
Getting Started with Lossless Compression
Compressing Files on the Command Line
gzip (Unix/Linux/macOS):
gzip filename.txt # compress
gunzip filename.txt.gz # decompress
gzip -k filename.txt # keep original
zip (cross-platform):
zip archive.zip file1.txt file2.txt
zip -r archive.zip folder/ # recursive
7-Zip:
7z a archive.7z files/ # create archive
7z x archive.7z # extract
zstd:
zstd filename.txt # compress
unzstd filename.txt.zst # decompress
zstd -19 filename.txt # level 19 compression (slower, smaller)
Compressing Images Without Quality Loss
For PNG images, use pngcrush or optipng:
optipng -o7 image.png # maximum optimization
For JPEG to PNG conversion (when you need lossless), but be warned—JPEG-to-PNG doesn't actually reduce file size since PNG doesn't handle photographic data efficiently.
The Reality of Compression Limits
No algorithm can compress random data. If you take a file of pure noise, compression will make it larger, not smaller. This is fundamental—compression exploits patterns, and random data has none.
Compressibility depends on:
- Entropy — Lower entropy (more predictable data) compresses better
- Redundancy — Repeated patterns are gold for compression
- Data type — Plain text compresses 60-70%. Already-compressed JPEG images compress maybe 5%.
The theoretical limit is the entropy of the source. Most practical algorithms get within 10-20% of that limit. DEFLATE is well-understood territory. If you need better ratios, look at context modeling or specialized algorithms for your specific data type.