Huffman Coding File Compression

A high-performance file compression tool implementing a multi-stage compression pipeline: BWT → MTF → RLE → Huffman Coding

Features

Multi-Stage Compression Pipeline
- BWT (Burrows-Wheeler Transform): Rearranges data to cluster similar characters
- MTF (Move-to-Front): Transforms repeated characters into small values
- RLE (Run-Length Encoding): Compresses runs of identical bytes
- Huffman Coding: Entropy encoding as the final compression stage
Parallel Processing: OpenMP-based parallelization for improved performance on multi-core systems
Sequential Mode: Optimized single-threaded implementation
Comprehensive Testing: 18-test suite covering all pipeline components
Cross-Platform: Works on Linux, macOS, and other POSIX systems

Performance

Compression Results

File Size	Original	Compressed	Compression Ratio
1 MB	1.0 MB	764 KB	74.6%
10 MB	10 MB	7.5 MB	74.6%
50 MB	50 MB	38 MB	74.6%
100 MB	100 MB	78 MB	74.6%

Notes:

Compression ratio varies by data type (random data is hardest to compress)
Repetitive/text data achieves better compression (typically 40-60% of original size)
All decompressed files verified identical to originals using diff

Parallel Performance Analysis

Performance metrics using -p flag with different thread counts:

10 MB File

Threads	Time	Speedup	Efficiency	Cost
1 (seq)	0.269s	1.00x	100.0%	0.269
2	0.232s	1.15x	57.5%	0.464
4	0.197s	1.36x	34.0%	0.788
8	0.136s	1.97x	24.6%	1.088
16	0.146s	1.84x	11.5%	2.336

50 MB File

Threads	Time	Speedup	Efficiency	Cost
1 (seq)	1.309s	1.00x	100.0%	1.309
2	1.124s	1.16x	58.0%	2.248
4	0.845s	1.54x	38.5%	3.380
8	0.685s	1.91x	23.8%	5.480
16	0.630s	2.07x	12.9%	10.080

100 MB File

Threads	Time	Speedup	Efficiency	Cost
1 (seq)	2.632s	1.00x	100.0%	2.632
2	2.000s	1.31x	65.5%	4.000
4	1.587s	1.65x	41.2%	6.348
8	1.265s	2.08x	26.0%	10.120
16	1.260s	2.08x	13.0%	20.160

500 MB File

Threads	Time	Speedup	Efficiency	Cost
1 (seq)	13.372s	1.00x	100.0%	13.372
2	10.284s	1.30x	65.0%	20.568
4	7.902s	1.69x	42.3%	31.608
8	6.230s	2.15x	26.8%	49.840
16	6.075s	2.20x	13.8%	97.200

1 GB File

Threads	Time	Speedup	Efficiency	Cost
1 (seq)	27.496s	1.00x	100.0%	27.496
2	27.427s	1.00x	50.1%	54.854
4	17.643s	1.56x	39.0%	70.572
8	12.627s	2.18x	27.2%	101.016
16	12.714s	2.16x	13.5%	203.424

Performance Metrics:

Speedup = Sequential Time / Parallel Time
Efficiency = (Speedup / Threads) × 100%
Cost = Threads × Parallel Time

Thread Scaling Recommendations

Large files (≥500MB) with random/complex data: Use 8-16 threads for best speedup (~2.2x)
Files ≥100MB with random/complex data: Use 8-16 threads for best speedup (~2-4x)
Files 10-50MB: Use 8 threads for optimal balance (~1.91-1.97x speedup)
Files <10MB or highly repetitive data: Use sequential mode (parallel overhead not worth it)
Best efficiency: 2 threads (65% efficiency)
Best speedup: 16 threads on large random files (2.2x max)

Building

Requirements

GCC or Clang with C11 support
OpenMP support (for parallel version)
Make

Build Commands

# Build all (parallel and sequential versions)
make

# Build only parallel version
make huffcode

# Build only sequential version
make huffcode_seq

# Build test suite
make test_pipeline

# Clean build artifacts
make clean

Usage

Compression

# Compress a file (sequential mode)
./huffcode -i input.txt -o output.huff

# Compress with parallel mode
./huffcode -p -i input.txt -o output.huff

# Compress with specific thread count (parallel mode)
export OMP_NUM_THREADS=8
./huffcode -p -i large_file.txt -o compressed.huff

# Use sequential version (always single-threaded, no -p flag)
./huffcode_seq -i input.txt -o output.huff

Decompression

# Decompress a file (sequential mode)
./huffcode -d -i compressed.huff -o decompressed.txt

# Decompress with parallel mode
./huffcode -p -d -i compressed.huff -o decompressed.txt

# Decompress with sequential version
./huffcode_seq -d -i compressed.huff -o decompressed.txt

Options

-i <file>: Input file path (required)
-o <file>: Output file path (required)
-d: Decompress mode (default is compress)
-p: Use parallel version (OpenMP multi-threading)

Environment Variables

OMP_NUM_THREADS: Set number of threads when using -p flag (default: system max)

export OMP_NUM_THREADS=8  # Use 8 threads with -p flag
./huffcode -p -i large_file.txt -o compressed.huff

Testing

Run All Tests

# Run complete test suite (sequential, parallel, and pipeline tests)
make check

This runs:

Sequential huffcode tests (basic compression/decompression)
Parallel huffcode tests (multi-threaded compression)
Pipeline component tests (18 tests covering BWT, MTF, RLE, Huffman, and integration)

Run Specific Test Suites

# Sequential tests only
make check_seq

# Parallel tests only
make check_parallel

# Pipeline tests only (BWT, MTF, RLE, Huffman integration)
make check_pipeline

# Or run pipeline tests directly
./test_pipeline

Test Coverage

The test_pipeline suite includes:

BWT Tests (4): Simple strings, repetitive patterns, edge cases
MTF Tests (3): Character patterns, binary data
RLE Tests (3): Run compression, no-run data, long runs
Pipeline Integration (1): Full BWT→MTF→RLE→Huffman pipeline
Huffman Tests (5): Small text, repetitive data, binary data, edge cases
Edge Cases (2): Two-symbol data, medium text (10KB)

Valgrind Memory Checks

# Run with Valgrind (checks for memory leaks)
make valgrind_check

Compression Pipeline Details

How It Works

BWT (Burrows-Wheeler Transform)
- Rearranges input data using lexicographic rotation sorting
- Clusters similar characters together
- Reversible with stored index position
- Time: O(n² log n), Space: O(n)
MTF (Move-to-Front)
- Maintains a list of all 256 possible byte values
- When a byte is encountered, output its position and move it to front
- Repeated bytes get small position values
- Time: O(n), Space: O(1)
RLE (Run-Length Encoding)
- Compresses runs of 3+ identical bytes
- Stores original length in 4-byte header
- Preserves incompressible data as-is
- Time: O(n), Space: O(n)
Huffman Coding
- Builds frequency table and optimal prefix-free code tree
- Encodes data using variable-length codes
- Parallel implementation for frequency counting
- Time: O(n log n), Space: O(n)

File Format

Compressed files contain:

Original file size (4 bytes)
BWT transform index (4 bytes)
RLE-encoded data length (4 bytes)
Huffman tree structure
Compressed bit stream

Project Structure

.
├── huffcode.c              # Main command-line interface
├── huffman.c               # Huffman coding + pipeline integration
├── huffman_parallel.c      # Parallel Huffman implementation (OpenMP)
├── bwt.c / bwt.h          # Burrows-Wheeler Transform
├── mtf.c / mtf.h          # Move-to-Front encoding
├── rle.c / rle.h          # Run-Length Encoding
├── test_pipeline.c         # Comprehensive test suite (18 tests)
├── run_tests.sh            # Basic compression test script
├── Makefile                # Build system
└── test_data/              # Test files (1MB-100MB)

Algorithm Complexity

Algorithm	Encoding Time	Encoding Space	Decoding Time	Decoding Space
BWT	O(n² log n)	O(n)	O(n)	O(n)
MTF	O(n)	O(1)	O(n)	O(1)
RLE	O(n)	O(n)	O(n)	O(n)
Huffman	O(n log n)	O(n)	O(n)	O(n)
Total	O(n² log n)	O(n)	O(n)	O(n)

Note: BWT dominates encoding complexity; decoding is linear time.

Known Limitations

BWT encoding is O(n²) time complexity - slower on very large files
Random data compresses poorly (~75% ratio) due to high entropy
Files <1MB show minimal benefit from parallelization
Maximum file size limited by available memory (BWT requires ~2x file size in RAM)

Examples

Compress a Text File

$ ./huffcode -i document.txt -o document.huff
$ ls -lh document.*
-rw-r--r-- 1 user group  100K  document.txt
-rw-r--r-- 1 user group   45K  document.huff    # ~45% of original

Compress with 8 Threads

$ export OMP_NUM_THREADS=8
$ ./huffcode -p -i large_file.txt -o large_file.huff

Decompress and Verify

$ ./huffcode -d -i document.huff -o restored.txt
$ diff document.txt restored.txt
$ echo $?
0    # Files are identical!

Benchmark Performance

# Sequential compression
$ time ./huffcode -i test_data/random_10MB.txt -o /tmp/test.huff
real    0m0.269s
user    0m0.265s
sys     0m0.004s

# Parallel compression (8 threads)
$ export OMP_NUM_THREADS=8
$ time ./huffcode -p -i test_data/random_10MB.txt -o /tmp/test.huff
real    0m0.136s
user    0m0.856s
sys     0m0.008s

# Speedup: 1.97x faster!

License

See LICENSE file for details.

Contributing

Contributions welcome! Areas for improvement:

Faster BWT implementation (suffix array construction)
Block-based processing for better memory efficiency
Additional compression algorithms (LZ77, LZMA)
Improved parallel scaling for small files

References

Burrows, M. and Wheeler, D. J. (1994), A block-sorting lossless data compression algorithm
Huffman, D. A. (1952), A Method for the Construction of Minimum-Redundancy Codes
Bentley, J. L., et al. (1986), A locally adaptive data compression scheme

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
test_data		test_data
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
benchmark.sh		benchmark.sh
bwt.c		bwt.c
bwt.h		bwt.h
huffcode.c		huffcode.c
huffman.c		huffman.c
huffman.h		huffman.h
huffman_parallel.c		huffman_parallel.c
huffman_parallel.h		huffman_parallel.h
mtf.c		mtf.c
mtf.h		mtf.h
rle.c		rle.c
rle.h		rle.h
run_tests.sh		run_tests.sh
test_pipeline.c		test_pipeline.c

License

Excius/file-compression

Folders and files

Latest commit

History

Repository files navigation

Huffman Coding File Compression

Features

Performance

Compression Results

Parallel Performance Analysis

10 MB File

50 MB File

100 MB File

500 MB File

1 GB File

Thread Scaling Recommendations

Building

Requirements

Build Commands

Usage

Compression

Decompression

Options

Environment Variables

Testing

Run All Tests

Run Specific Test Suites

Test Coverage

Valgrind Memory Checks

Compression Pipeline Details

How It Works

File Format

Project Structure

Algorithm Complexity

Known Limitations

Examples

Compress a Text File

Compress with 8 Threads

Decompress and Verify

Benchmark Performance

License

Contributing

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages