Skip to content

Excius/file-compression

Repository files navigation

Huffman Coding File Compression

A high-performance file compression tool implementing a multi-stage compression pipeline: BWT → MTF → RLE → Huffman Coding

Features

  • Multi-Stage Compression Pipeline

    • BWT (Burrows-Wheeler Transform): Rearranges data to cluster similar characters
    • MTF (Move-to-Front): Transforms repeated characters into small values
    • RLE (Run-Length Encoding): Compresses runs of identical bytes
    • Huffman Coding: Entropy encoding as the final compression stage
  • Parallel Processing: OpenMP-based parallelization for improved performance on multi-core systems

  • Sequential Mode: Optimized single-threaded implementation

  • Comprehensive Testing: 18-test suite covering all pipeline components

  • Cross-Platform: Works on Linux, macOS, and other POSIX systems

Performance

Compression Results

File Size Original Compressed Compression Ratio
1 MB 1.0 MB 764 KB 74.6%
10 MB 10 MB 7.5 MB 74.6%
50 MB 50 MB 38 MB 74.6%
100 MB 100 MB 78 MB 74.6%

Notes:

  • Compression ratio varies by data type (random data is hardest to compress)
  • Repetitive/text data achieves better compression (typically 40-60% of original size)
  • All decompressed files verified identical to originals using diff

Parallel Performance Analysis

Performance metrics using -p flag with different thread counts:

10 MB File

Threads Time Speedup Efficiency Cost
1 (seq) 0.269s 1.00x 100.0% 0.269
2 0.232s 1.15x 57.5% 0.464
4 0.197s 1.36x 34.0% 0.788
8 0.136s 1.97x 24.6% 1.088
16 0.146s 1.84x 11.5% 2.336

50 MB File

Threads Time Speedup Efficiency Cost
1 (seq) 1.309s 1.00x 100.0% 1.309
2 1.124s 1.16x 58.0% 2.248
4 0.845s 1.54x 38.5% 3.380
8 0.685s 1.91x 23.8% 5.480
16 0.630s 2.07x 12.9% 10.080

100 MB File

Threads Time Speedup Efficiency Cost
1 (seq) 2.632s 1.00x 100.0% 2.632
2 2.000s 1.31x 65.5% 4.000
4 1.587s 1.65x 41.2% 6.348
8 1.265s 2.08x 26.0% 10.120
16 1.260s 2.08x 13.0% 20.160

500 MB File

Threads Time Speedup Efficiency Cost
1 (seq) 13.372s 1.00x 100.0% 13.372
2 10.284s 1.30x 65.0% 20.568
4 7.902s 1.69x 42.3% 31.608
8 6.230s 2.15x 26.8% 49.840
16 6.075s 2.20x 13.8% 97.200

1 GB File

Threads Time Speedup Efficiency Cost
1 (seq) 27.496s 1.00x 100.0% 27.496
2 27.427s 1.00x 50.1% 54.854
4 17.643s 1.56x 39.0% 70.572
8 12.627s 2.18x 27.2% 101.016
16 12.714s 2.16x 13.5% 203.424

Performance Metrics:

  • Speedup = Sequential Time / Parallel Time
  • Efficiency = (Speedup / Threads) × 100%
  • Cost = Threads × Parallel Time

Thread Scaling Recommendations

  • Large files (≥500MB) with random/complex data: Use 8-16 threads for best speedup (~2.2x)
  • Files ≥100MB with random/complex data: Use 8-16 threads for best speedup (~2-4x)
  • Files 10-50MB: Use 8 threads for optimal balance (~1.91-1.97x speedup)
  • Files <10MB or highly repetitive data: Use sequential mode (parallel overhead not worth it)
  • Best efficiency: 2 threads (65% efficiency)
  • Best speedup: 16 threads on large random files (2.2x max)

Building

Requirements

  • GCC or Clang with C11 support
  • OpenMP support (for parallel version)
  • Make

Build Commands

# Build all (parallel and sequential versions)
make

# Build only parallel version
make huffcode

# Build only sequential version
make huffcode_seq

# Build test suite
make test_pipeline

# Clean build artifacts
make clean

Usage

Compression

# Compress a file (sequential mode)
./huffcode -i input.txt -o output.huff

# Compress with parallel mode
./huffcode -p -i input.txt -o output.huff

# Compress with specific thread count (parallel mode)
export OMP_NUM_THREADS=8
./huffcode -p -i large_file.txt -o compressed.huff

# Use sequential version (always single-threaded, no -p flag)
./huffcode_seq -i input.txt -o output.huff

Decompression

# Decompress a file (sequential mode)
./huffcode -d -i compressed.huff -o decompressed.txt

# Decompress with parallel mode
./huffcode -p -d -i compressed.huff -o decompressed.txt

# Decompress with sequential version
./huffcode_seq -d -i compressed.huff -o decompressed.txt

Options

  • -i <file>: Input file path (required)
  • -o <file>: Output file path (required)
  • -d: Decompress mode (default is compress)
  • -p: Use parallel version (OpenMP multi-threading)

Environment Variables

  • OMP_NUM_THREADS: Set number of threads when using -p flag (default: system max)
    export OMP_NUM_THREADS=8  # Use 8 threads with -p flag
    ./huffcode -p -i large_file.txt -o compressed.huff

Testing

Run All Tests

# Run complete test suite (sequential, parallel, and pipeline tests)
make check

This runs:

  1. Sequential huffcode tests (basic compression/decompression)
  2. Parallel huffcode tests (multi-threaded compression)
  3. Pipeline component tests (18 tests covering BWT, MTF, RLE, Huffman, and integration)

Run Specific Test Suites

# Sequential tests only
make check_seq

# Parallel tests only
make check_parallel

# Pipeline tests only (BWT, MTF, RLE, Huffman integration)
make check_pipeline

# Or run pipeline tests directly
./test_pipeline

Test Coverage

The test_pipeline suite includes:

  • BWT Tests (4): Simple strings, repetitive patterns, edge cases
  • MTF Tests (3): Character patterns, binary data
  • RLE Tests (3): Run compression, no-run data, long runs
  • Pipeline Integration (1): Full BWT→MTF→RLE→Huffman pipeline
  • Huffman Tests (5): Small text, repetitive data, binary data, edge cases
  • Edge Cases (2): Two-symbol data, medium text (10KB)

Valgrind Memory Checks

# Run with Valgrind (checks for memory leaks)
make valgrind_check

Compression Pipeline Details

How It Works

  1. BWT (Burrows-Wheeler Transform)

    • Rearranges input data using lexicographic rotation sorting
    • Clusters similar characters together
    • Reversible with stored index position
    • Time: O(n² log n), Space: O(n)
  2. MTF (Move-to-Front)

    • Maintains a list of all 256 possible byte values
    • When a byte is encountered, output its position and move it to front
    • Repeated bytes get small position values
    • Time: O(n), Space: O(1)
  3. RLE (Run-Length Encoding)

    • Compresses runs of 3+ identical bytes
    • Stores original length in 4-byte header
    • Preserves incompressible data as-is
    • Time: O(n), Space: O(n)
  4. Huffman Coding

    • Builds frequency table and optimal prefix-free code tree
    • Encodes data using variable-length codes
    • Parallel implementation for frequency counting
    • Time: O(n log n), Space: O(n)

File Format

Compressed files contain:

  1. Original file size (4 bytes)
  2. BWT transform index (4 bytes)
  3. RLE-encoded data length (4 bytes)
  4. Huffman tree structure
  5. Compressed bit stream

Project Structure

.
├── huffcode.c              # Main command-line interface
├── huffman.c               # Huffman coding + pipeline integration
├── huffman_parallel.c      # Parallel Huffman implementation (OpenMP)
├── bwt.c / bwt.h          # Burrows-Wheeler Transform
├── mtf.c / mtf.h          # Move-to-Front encoding
├── rle.c / rle.h          # Run-Length Encoding
├── test_pipeline.c         # Comprehensive test suite (18 tests)
├── run_tests.sh            # Basic compression test script
├── Makefile                # Build system
└── test_data/              # Test files (1MB-100MB)

Algorithm Complexity

Algorithm Encoding Time Encoding Space Decoding Time Decoding Space
BWT O(n² log n) O(n) O(n) O(n)
MTF O(n) O(1) O(n) O(1)
RLE O(n) O(n) O(n) O(n)
Huffman O(n log n) O(n) O(n) O(n)
Total O(n² log n) O(n) O(n) O(n)

Note: BWT dominates encoding complexity; decoding is linear time.

Known Limitations

  • BWT encoding is O(n²) time complexity - slower on very large files
  • Random data compresses poorly (~75% ratio) due to high entropy
  • Files <1MB show minimal benefit from parallelization
  • Maximum file size limited by available memory (BWT requires ~2x file size in RAM)

Examples

Compress a Text File

$ ./huffcode -i document.txt -o document.huff
$ ls -lh document.*
-rw-r--r-- 1 user group  100K  document.txt
-rw-r--r-- 1 user group   45K  document.huff    # ~45% of original

Compress with 8 Threads

$ export OMP_NUM_THREADS=8
$ ./huffcode -p -i large_file.txt -o large_file.huff

Decompress and Verify

$ ./huffcode -d -i document.huff -o restored.txt
$ diff document.txt restored.txt
$ echo $?
0    # Files are identical!

Benchmark Performance

# Sequential compression
$ time ./huffcode -i test_data/random_10MB.txt -o /tmp/test.huff
real    0m0.269s
user    0m0.265s
sys     0m0.004s

# Parallel compression (8 threads)
$ export OMP_NUM_THREADS=8
$ time ./huffcode -p -i test_data/random_10MB.txt -o /tmp/test.huff
real    0m0.136s
user    0m0.856s
sys     0m0.008s

# Speedup: 1.97x faster!

License

See LICENSE file for details.

Contributing

Contributions welcome! Areas for improvement:

  • Faster BWT implementation (suffix array construction)
  • Block-based processing for better memory efficiency
  • Additional compression algorithms (LZ77, LZMA)
  • Improved parallel scaling for small files

References

  • Burrows, M. and Wheeler, D. J. (1994), A block-sorting lossless data compression algorithm
  • Huffman, D. A. (1952), A Method for the Construction of Minimum-Redundancy Codes
  • Bentley, J. L., et al. (1986), A locally adaptive data compression scheme

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published