GitHub - SSL-ACTX/helix: A high-performance, fault-tolerant compiler for archiving digital data into DNA oligonucleotides. Implements AES-GCM encryption, Reed-Solomon redundancy, and Viterbi error correction.

A Systems-Level DNA Storage Archiver written in Rust.

Streaming I/O • AES-256-GCM • Reed-Solomon (N+K) • Viterbi Correction

Helix is a high-performance compiler designed to bridge the gap between binary data and biological storage. It transforms digital files into biostable DNA oligonucleotides formatted for synthesis and deep-time archival.

Unlike simple transcoders, Helix implements a full "Systems Storage" stack, handling encryption, error correction (Erasure + Viterbi), biological safety checks, and random access retrieval.

Important

Experimental Research Prototype This project is a rigorous software implementation of DNA storage principles. While the encoding algorithms are designed to be biologically sound (GC-balanced, homopolymer-free, collision-resistant), this specific implementation has not yet been validated via wet-lab synthesis and sequencing.

Use this tool for research, simulation, and algorithmic verification. Do not use for critical long-term archival without physical validation of the primer sets and payload stability.

🚀 Key Capabilities

⚡ Performance & Scale

Smart Streaming Architecture: * Constant Memory Footprint: Processes files in 4MB streaming chunks. This allows archiving multi-terabyte datasets with a minimal RAM footprint (~80MB peak), preventing OOM crashes even on constrained legacy hardware.
- Memory-Aware Backpressure: The batch iterator monitors byte usage, not just line counts, ensuring "DNA Soup" files (massive single lines or many small lines) never exhaust physical RAM.
Massively Parallel: Utilizes Rayon to parallelize CRC hashing, Reed-Solomon encoding, DNA translation, search filtering, and decay simulation across all available CPU cores (-j flag).
Zstd Compression: Applies Zstandard (Level 3) compression before encoding to maximize the Bits-per-Molecule density.

🧬 Biological Integrity

Homopolymer Prevention: Uses a Rotating Base-3 Trellis state machine. This ensures that no base is ever repeated (e.g., AAAA or GGGG is mathematically impossible), significantly reducing sequencing errors.
Auto-Correction for Stability: * Salt & Retry Mechanism: If a block produces unstable DNA (bad GC content or $T_m$), the compiler automatically rotates the block's cryptographic salt and re-encodes. This changes the bitstream—and thus the DNA sequence—transparently until biological constraints are met.
- Synthesis Safety Guard: Analyzes every strand for GC-Content (40-60% window) and Melting Temperature ($T_m$).
Fuzzy Primer Matching: The decoder employs Hamming distance checks (tolerance of 3 mismatches) to identify primers even when mutated. This prevents valid data from being discarded due to "Zip Code" rot.
Primer Collision Avoidance: Scans payloads for accidental primer sequences and utilizes trellis chaining (FP -> Address -> Payload -> RP) to ensure seamless transitions.

🛡️ Security & Resilience

Cryptographic Access: * Argon2id for Master Key derivation (memory-hard).
- HKDF + AES-GCM for per-block session keys. A unique nonce and salt for every block means identical files produce completely different DNA streams.
Multi-Layer Error Correction:
- Reed-Solomon (Erasure Coding): Configurable redundancy (Default: 10 Data + 5 Parity) recovers files even if 33% of strands are completely lost.
- Viterbi Decoder (Mutation Correction): Treats DNA as a "Noisy Channel." If a strand fails integrity checks, the Viterbi engine finds the optimal path through the trellis to "heal" substitution errors, recovering data from strands with ~1.0% mutation rates.
Chemical Corruption Detection: A CRC32 checksum is prepended to every shard to validate the final output of the Viterbi decode.

🔍 Molecular Random Access

In-Silico PCR (Streaming Search): Supports memory-safe "Soft-Search" by filtering gigabytes of mixed DNA data ("The Soup") for specific primer tags using a parallelized, streaming map-reduce approach.
Configurable Primers: Users can define custom Forward/Reverse primers to physically address specific files within a biological pool.

🏗 System Architecture

The Helix Pipeline operates on 4MB independent blocks, transforming binary data through 5 distinct layers:

L1 - Stream & Compress: The file is read in buffered 4MB chunks and compressed via Zstd.
L2 - Encryption: The compressed chunk is encrypted (AES-256-GCM) using a unique nonce and salt per block. Note: If stability checks fail, this step is re-run with a new salt.
L3 - Redundancy: The blob is split into $N$ data shards. $K$ parity shards are generated using Galois Field arithmetic (Reed-Solomon).
L4 - Transcoding: * Each shard is prepended with a CRC32 checksum.
- Binary data is mapped to DNA bases using the constrained trellis.
- Primers and Index Addresses are attached: [FwdPrimer] [Address] [Payload] [RevPrimer].
L5 - Analysis: The resulting Oligo is checked for biological stability metrics (GC% and $T_m$).

📦 Installation

# Clone the repository
git clone https://github.com/SSL-ACTX/helix.git
cd helix

# Build optimized binary
cargo build --release

# Or install directly
cargo install --git https://github.com/SSL-ACTX/helix.git

💻 Usage Guide

1. Compile (Archive)

Encrypts, compresses, and encodes a file into a DNA stream.

# Standard encoding (Auto-threading)
./target/release/helix compile database.dump --output archive.fasta

# High-Security Mode (Custom Password & High Redundancy)
./target/release/helix compile secrets.pdf \
    --password "hunter2" \
    --data 20 --parity 10

# Custom Primers (for physical PCR addressing)
./target/release/helix compile project.zip \
    --primer-fwd "GCTAGCTAGCTAGCTAGCTA" \
    --primer-rev "CGATCGATCGATCGATCGAT"

2. Search (Molecular Filtering)

Extracts specific strands from a massive DNA dataset based on tags or primers. Now safe for files larger than RAM.

# Search by Tag
./target/release/helix search soup.fasta "project_alpha" --output found.fasta

# Search by Custom Primer
./target/release/helix search soup.fasta \
    --primer-fwd "GCTAGCTAGCTAGCTAGCTA" \
    --primer-rev "CGATCGATCGATCGATCGAT" \
    --output found.fasta

3. Restore (Decode)

Recovers the binary file from a DNA stream. Supports out-of-order recovery and streaming writes.

./target/release/helix restore archive.fasta recovered.file \
    --password "hunter2" \
    --data 20 --parity 10

4. Simulate Decay (Chaos Monkey)

Simulates "Deep Time" storage by randomly deleting strands (dropout) and introducing bit-rot (mutation) to test robustness.

# Simulate 10,000 years of decay (30% dropout + 0.5% mutation rate)
./target/release/helix simulate archive.fasta \
    --dropout 30 \
    --mutation 0.005 \
    --output decayed.fasta

🧪 Verification

Helix includes a rigorous Python validation suite (full_test.py) that tests the entire stack against edge cases:

Concurrency Interop: Verifies thread safety between sequential and parallel modes.
Cryptographic Denial: Ensures wrong passwords yield fatal errors.
Catastrophic Data Loss: Tests recovery limits (> Parity limit).
Bit-Rot/Mutation: Verifies CRC32 detection of mutated bases using the internal mutation simulator.
Viterbi Repair: Validates the dynamic programming engine against heavy mutation scenarios (1.0% error rate).
Stability Enforcement: Stresses the "Salt & Retry" engine with pathological binary inputs.
Primer Safety: Fuzzing tests to ensure no accidental primer collisions occur in the payload.
Streaming Stress: Validates multi-block processing with files > RAM.

To run the full suite:

python tests/full_test.py

**Built with 🦀 and ☕ by Seuriin**

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dist		dist
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
plan.md		plan.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Key Capabilities

⚡ Performance & Scale

🧬 Biological Integrity

🛡️ Security & Resilience

🔍 Molecular Random Access

🏗 System Architecture

📦 Installation

💻 Usage Guide

1. Compile (Archive)

2. Search (Molecular Filtering)

3. Restore (Decode)

4. Simulate Decay (Chaos Monkey)

🧪 Verification

About

Uh oh!

Releases 1

Uh oh!

Languages

License

SSL-ACTX/helix

Folders and files

Latest commit

History

Repository files navigation

🚀 Key Capabilities

⚡ Performance & Scale

🧬 Biological Integrity

🛡️ Security & Resilience

🔍 Molecular Random Access

🏗 System Architecture

📦 Installation

💻 Usage Guide

1. Compile (Archive)

2. Search (Molecular Filtering)

3. Restore (Decode)

4. Simulate Decay (Chaos Monkey)

🧪 Verification

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Languages