Streaming content hashing — adds SHA-256 or BLAKE3 byte identity to every artifact in a manifest, enabling deduplication, caching, and immutability verification.
No AI. No inference. Pure deterministic hashing and serialization.
brew install cmdrvl/tap/hashThe Problem: After scanning a directory, you have a manifest of files — but no proof of what's inside them. Without byte-level identity, you can't detect changes, deduplicate, or verify integrity. Manual shasum loops are fragile, unstructured, and don't compose.
The Solution: One streaming command that reads a JSONL manifest and adds a cryptographic hash (bytes_hash) to every record. Parallel by default, constant memory, deterministic output order. Same inputs always produce the same hashes.
| Feature | What It Does |
|---|---|
| Streaming | Constant memory — hashes files through a 64 KB buffer, never loads entire files |
| Parallel | --jobs N workers (default: CPU count) with deterministic output ordering |
| Two algorithms | SHA-256 (default, widely verified) or BLAKE3 (faster, modern) |
| Skipped tracking | Unreadable files are captured with warnings, not silently dropped |
| Pipeline native | Reads vacuum JSONL, emits enriched JSONL for fingerprint and lock |
| Deterministic | Same manifest + same files = identical output regardless of --jobs |
| Audit trail | Every run recorded in the ambient witness ledger |
$ vacuum /data/dec | hash{"version":"hash.v0","path":"/data/dec/model.xlsx","relative_path":"model.xlsx","root":"/data/dec","size":2481920,"mtime":"2025-12-31T12:00:00.000Z","extension":".xlsx","mime_guess":"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","bytes_hash":"sha256:e3b0c44298fc1c149afbf4c8996fb924","hash_algorithm":"sha256","tool_versions":{"vacuum":"0.1.0","hash":"0.1.0"}}
{"version":"hash.v0","path":"/data/dec/tape.csv","relative_path":"tape.csv","root":"/data/dec","size":847201,"mtime":"2025-12-15T08:30:00.000Z","extension":".csv","mime_guess":"text/csv","bytes_hash":"sha256:7d865e959b2466918c9863afca942d0f","hash_algorithm":"sha256","tool_versions":{"vacuum":"0.1.0","hash":"0.1.0"}}Two files hashed — SHA-256 by default, tool versions accumulated, ready for fingerprint or lock.
# Use BLAKE3 for faster hashing:
$ vacuum /data/dec | hash --algorithm blake3
# Sequential processing (deterministic debugging):
$ vacuum /data/dec | hash --jobs 1
# Full pipeline into lockfile:
$ vacuum /data/dec | hash | lock --dataset-id "dec" > dec.lock.json
# With fingerprinting:
$ vacuum /data/models | hash | fingerprint --fp argus-model.v1 \
| lock --dataset-id "models" > models.lock.jsonhash is the second tool in the stream pipeline — it establishes byte identity.
vacuum → hash → fingerprint → lock → pack
(scan) (hash) (template) (pin) (seal)
Each tool reads JSONL from stdin and emits enriched JSONL to stdout. hash receives records from vacuum and adds bytes_hash — the cryptographic content identity that downstream tools depend on.
hash does not replace other pipeline tools.
| If you need... | Use |
|---|---|
| Enumerate files in a directory | vacuum |
| Match files against template definitions | fingerprint |
| Pin artifacts into a self-hashed lockfile | lock |
| Check structural comparability of CSVs | shape |
| Explain numeric changes between CSVs | rvl |
| Bundle into immutable evidence packs | pack |
hash only answers: what is the exact byte identity of each artifact?
hash emits exactly one domain outcome.
Every input record was successfully hashed. No skipped records in the output.
$ vacuum /data/dec | hash
# exit 0 — all files hashedAt least one record has _skipped: true — either passed through from upstream or newly skipped because hash couldn't read the file. Remaining records are hashed normally. The output is valid but incomplete.
$ vacuum /data/dec | hash
# exit 1 — some files couldn't be hashed
# check: jq 'select(._skipped == true)' to see whichInput stream is invalid — not valid JSONL, missing required fields, or I/O error on stdin/stdout.
{
"code": "E_BAD_INPUT",
"message": "Invalid JSONL on line 3",
"detail": { "line": 3, "error": "expected value at line 1 column 1" },
"next_command": null
}| Capability | hash | shasum / sha256sum |
b3sum |
Custom script |
|---|---|---|---|---|
| Streaming JSONL pipeline | Yes | No | No | You write it |
| Parallel with deterministic order | Yes | No | Yes | You write it |
| SHA-256 + BLAKE3 in one tool | Yes | SHA only | BLAKE3 only | You write it |
| Skipped file tracking | Yes (with warnings) | Fails | Fails | You write it |
| Upstream field passthrough | Yes | No | No | You write it |
| Tool version accumulation | Yes | No | No | No |
| Audit trail (witness ledger) | Yes | No | No | No |
| Constant memory (streaming) | Yes | Yes | Yes | Depends |
When to use hash:
- Middle of the epistemic pipeline — add byte identity between
vacuumandlock - Deduplication and caching — detect identical files by hash
- Integrity verification — prove file contents haven't changed
When hash might not be ideal:
- You just need a quick hash of one file — use
shasumorb3sumdirectly - You need content-aware hashing (e.g., ignoring whitespace) — use
fingerprintcontent hashes - You need hash trees or verified streaming — deferred in v0
brew install cmdrvl/tap/hashcurl -fsSL https://raw.githubusercontent.com/cmdrvl/hash/main/scripts/install.sh | bashcargo build --release
./target/release/hash --helphash [<INPUT>] [OPTIONS]
hash witness <query|last|count> [OPTIONS][INPUT]: JSONL manifest file. Defaults to stdin.
| Flag | Type | Default | Description |
|---|---|---|---|
--algorithm <ALG> |
string | sha256 |
Hash algorithm: sha256 or blake3 (case-insensitive) |
--jobs <N> |
integer | CPU count | Parallel workers; --jobs 1 for sequential |
--no-witness |
flag | false |
Suppress witness ledger recording |
--describe |
flag | false |
Print compiled operator.json to stdout, exit 0 |
--schema |
flag | false |
Print JSON Schema to stdout, exit 0 |
--progress |
flag | false |
Emit structured progress JSONL to stderr |
--version |
flag | false |
Print hash <semver> to stdout, exit 0 |
| Code | Meaning |
|---|---|
0 |
ALL_HASHED (every record hashed successfully) |
1 |
PARTIAL (some records skipped) |
2 |
REFUSAL or CLI error |
stdout: enriched JSONL records (one per input record)stderr: progress diagnostics (with--progress) or warnings
JSONL records from vacuum (or any tool producing vacuum.v0 records). Required fields:
path— absolute file path (used to open and read the file)version— upstream record version
Each input record is enriched with hash fields:
{
"version": "hash.v0",
"path": "/data/dec/tape.csv",
"relative_path": "tape.csv",
"root": "/data/dec",
"size": 847201,
"mtime": "2025-12-15T08:30:00.000Z",
"extension": ".csv",
"mime_guess": "text/csv",
"bytes_hash": "sha256:7d865e959b2466918c9863afca942d0fb7903eb3",
"hash_algorithm": "sha256",
"tool_versions": { "vacuum": "0.1.0", "hash": "0.1.0" }
}| Added Field | Type | Description |
|---|---|---|
bytes_hash |
string | <algorithm>:<lowercase-hex> (null if _skipped) |
hash_algorithm |
string | "sha256" or "blake3" (null if _skipped) |
All upstream fields are passed through. version is updated to "hash.v0". tool_versions is merged with hash added.
- SHA-256:
sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855(64 hex chars) - BLAKE3:
blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262(64 hex chars)
- Upstream
_skipped: Passed through unchanged — hash does NOT attempt to hash, does NOT modify_warnings, DOES updateversionandtool_versions - New skip: If hash can't read a file, it marks
_skipped: trueand appends a warning:
{
"_skipped": true,
"_warnings": [
{ "tool": "hash", "code": "E_FILE_READ", "message": "Cannot read file", "detail": { "path": "/data/dec/locked.xlsx", "error": "Permission denied" } }
]
}| Code | Trigger | Next Step |
|---|---|---|
E_BAD_INPUT |
Not valid JSONL or missing required fields | Check upstream output (run vacuum first) |
E_IO |
Cannot read input/output stream | Check stdin/stdout and filesystem |
You're piping something that isn't valid JSONL. Most common cause: piping raw file paths instead of vacuum output.
# Wrong:
echo "/data/tape.csv" | hash
# Right:
vacuum /data | hashhash couldn't read the file (permission denied, file deleted between vacuum and hash). Check the _warnings array:
vacuum /data | hash | jq 'select(._skipped == true) | {path, _warnings}'This should not happen — hash guarantees deterministic output order regardless of parallelism. If you see different hashes, the files may have changed between runs.
SHA-256 is the default and most widely verified. BLAKE3 is faster (especially on large files) but produces different hashes. Choose one and stick with it — lock records hash_algorithm so downstream tools know which was used.
Parallel workers have overhead per file. For many small files, try --jobs 1 to reduce scheduling overhead. For large files, more --jobs helps.
| Limitation | Detail |
|---|---|
| Whole-file hashing only | No range hashing or content-aware hashing — use fingerprint content hashes for that |
| No hash trees | No Merkle tree or bao verified streaming — deferred in v0 |
| No XXH3 | Non-cryptographic fast hashing not available — deferred in v0 |
| In-memory record buffering | Output ordering requires buffering; not true streaming for very large manifests |
| No hash verification | hash computes hashes — it doesn't verify stored hashes against current files |
| No dedup output | hash reports hashes but doesn't flag duplicates — do that downstream |
shasum produces unstructured text (hash filename). hash produces structured JSONL that preserves all upstream fields, accumulates tool versions, tracks skipped files, and pipes directly into fingerprint and lock. It's also parallel by default.
SHA-256 is ubiquitous — auditors, compliance teams, and most tooling expect it. BLAKE3 is available via --algorithm blake3 when speed matters more than compatibility.
No. hash uses a 64 KB streaming buffer — memory usage is constant regardless of file size. A 100 GB file uses the same memory as a 1 KB file.
Determinism. Each record gets a sequence number. Workers hash in parallel, but the output thread reorders results by sequence number before emitting. Same input always produces the same output.
Yes — any JSONL with a path field works. But vacuum provides the standard record format that downstream tools expect.
hash passes them through without attempting to hash. It updates version to hash.v0 and merges tool_versions, but leaves _skipped, _warnings, and other fields untouched.
Yes. SHA-256 of zero bytes is always sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855. BLAKE3 of zero bytes is always blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262.
$ hash --describe | jq '.exit_codes'
{
"0": { "meaning": "ALL_HASHED" },
"1": { "meaning": "PARTIAL" },
"2": { "meaning": "REFUSAL" }
}
$ hash --describe | jq '.pipeline'
{
"upstream": ["vacuum"],
"downstream": ["fingerprint", "lock", "pack"]
}# 1. Hash all artifacts
vacuum /data/dec | hash > hashed.jsonl
case $? in
0) echo "all hashed" ;;
1) echo "partial — some files couldn't be hashed"
jq -s '[.[] | select(._skipped == true)] | length' hashed.jsonl ;;
2) echo "refusal"
cat hashed.jsonl | jq '.code'
exit 1 ;;
esac
# 2. Continue pipeline
cat hashed.jsonl | lock --dataset-id "dec" > dec.lock.json- Exit codes —
0/1/2map to complete/partial/error branching - Structured JSONL only — stdout is always machine-readable
--describe— printsoperator.jsonso an agent discovers the tool without reading docs--schema— prints the record JSON schema for programmatic validation- Deterministic — same input always produces the same output, enabling reliable caching
Witness Subcommands
hash records every run to an ambient witness ledger. You can query this ledger:
# Query by date range or outcome
hash witness query --tool hash --since 2026-01-01 --outcome ALL_HASHED --json
# Get the most recent run
hash witness last --json
# Count runs matching a filter
hash witness count --since 2026-02-01hash witness query [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
[--outcome <ALL_HASHED|PARTIAL|REFUSAL>] [--input-hash <substring>] \
[--limit <n>] [--json]
hash witness last [--json]
hash witness count [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
[--outcome <ALL_HASHED|PARTIAL|REFUSAL>] [--input-hash <substring>] [--json]| Code | Meaning |
|---|---|
0 |
One or more matching records returned |
1 |
No matches (or empty ledger for last) |
2 |
CLI parse error or witness internal error |
- Default:
~/.epistemic/witness.jsonl - Override: set
EPISTEMIC_WITNESSenvironment variable - Malformed ledger lines are skipped; valid lines continue to be processed.
The full specification is docs/PLAN.md. This README covers intended v0 behavior; the spec adds implementation details, edge-case definitions, and testing requirements.
cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test