hash

Streaming content hashing — adds SHA-256 or BLAKE3 byte identity to every artifact in a manifest, enabling deduplication, caching, and immutability verification.

No AI. No inference. Pure deterministic hashing and serialization.

brew install cmdrvl/tap/hash

TL;DR

The Problem: After scanning a directory, you have a manifest of files — but no proof of what's inside them. Without byte-level identity, you can't detect changes, deduplicate, or verify integrity. Manual shasum loops are fragile, unstructured, and don't compose.

The Solution: One streaming command that reads a JSONL manifest and adds a cryptographic hash (bytes_hash) to every record. Parallel by default, constant memory, deterministic output order. Same inputs always produce the same hashes.

Why Use hash?

Feature	What It Does
Streaming	Constant memory — hashes files through a 64 KB buffer, never loads entire files
Parallel	`--jobs N` workers (default: CPU count) with deterministic output ordering
Two algorithms	SHA-256 (default, widely verified) or BLAKE3 (faster, modern)
Skipped tracking	Unreadable files are captured with warnings, not silently dropped
Pipeline native	Reads vacuum JSONL, emits enriched JSONL for `fingerprint` and `lock`
Deterministic	Same manifest + same files = identical output regardless of `--jobs`
Audit trail	Every run recorded in the ambient witness ledger

Quick Example

$ vacuum /data/dec | hash

{"version":"hash.v0","path":"/data/dec/model.xlsx","relative_path":"model.xlsx","root":"/data/dec","size":2481920,"mtime":"2025-12-31T12:00:00.000Z","extension":".xlsx","mime_guess":"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","bytes_hash":"sha256:e3b0c44298fc1c149afbf4c8996fb924","hash_algorithm":"sha256","tool_versions":{"vacuum":"0.1.0","hash":"0.1.0"}}
{"version":"hash.v0","path":"/data/dec/tape.csv","relative_path":"tape.csv","root":"/data/dec","size":847201,"mtime":"2025-12-15T08:30:00.000Z","extension":".csv","mime_guess":"text/csv","bytes_hash":"sha256:7d865e959b2466918c9863afca942d0f","hash_algorithm":"sha256","tool_versions":{"vacuum":"0.1.0","hash":"0.1.0"}}

Two files hashed — SHA-256 by default, tool versions accumulated, ready for fingerprint or lock.

# Use BLAKE3 for faster hashing:
$ vacuum /data/dec | hash --algorithm blake3

# Sequential processing (deterministic debugging):
$ vacuum /data/dec | hash --jobs 1

# Full pipeline into lockfile:
$ vacuum /data/dec | hash | lock --dataset-id "dec" > dec.lock.json

# With fingerprinting:
$ vacuum /data/models | hash | fingerprint --fp argus-model.v1 \
    | lock --dataset-id "models" > models.lock.json

Where hash Fits

hash is the second tool in the stream pipeline — it establishes byte identity.

vacuum  →  hash  →  fingerprint  →  lock  →  pack
(scan)    (hash)    (template)     (pin)    (seal)

Each tool reads JSONL from stdin and emits enriched JSONL to stdout. hash receives records from vacuum and adds bytes_hash — the cryptographic content identity that downstream tools depend on.

What hash Is Not

hash does not replace other pipeline tools.

If you need...	Use
Enumerate files in a directory	`vacuum`
Match files against template definitions	`fingerprint`
Pin artifacts into a self-hashed lockfile	`lock`
Check structural comparability of CSVs	`shape`
Explain numeric changes between CSVs	`rvl`
Bundle into immutable evidence packs	`pack`

hash only answers: what is the exact byte identity of each artifact?

The Three Outcomes

hash emits exactly one domain outcome.

1. ALL_HASHED (exit `0`)

Every input record was successfully hashed. No skipped records in the output.

$ vacuum /data/dec | hash
# exit 0 — all files hashed

2. PARTIAL (exit `1`)

At least one record has _skipped: true — either passed through from upstream or newly skipped because hash couldn't read the file. Remaining records are hashed normally. The output is valid but incomplete.

$ vacuum /data/dec | hash
# exit 1 — some files couldn't be hashed
# check: jq 'select(._skipped == true)' to see which

3. REFUSAL (exit `2`)

Input stream is invalid — not valid JSONL, missing required fields, or I/O error on stdin/stdout.

{
  "code": "E_BAD_INPUT",
  "message": "Invalid JSONL on line 3",
  "detail": { "line": 3, "error": "expected value at line 1 column 1" },
  "next_command": null
}

How hash Compares

Capability	hash	`shasum` / `sha256sum`	`b3sum`	Custom script
Streaming JSONL pipeline	Yes	No	No	You write it
Parallel with deterministic order	Yes	No	Yes	You write it
SHA-256 + BLAKE3 in one tool	Yes	SHA only	BLAKE3 only	You write it
Skipped file tracking	Yes (with warnings)	Fails	Fails	You write it
Upstream field passthrough	Yes	No	No	You write it
Tool version accumulation	Yes	No	No	No
Audit trail (witness ledger)	Yes	No	No	No
Constant memory (streaming)	Yes	Yes	Yes	Depends

When to use hash:

Middle of the epistemic pipeline — add byte identity between vacuum and lock
Deduplication and caching — detect identical files by hash
Integrity verification — prove file contents haven't changed

When hash might not be ideal:

You just need a quick hash of one file — use shasum or b3sum directly
You need content-aware hashing (e.g., ignoring whitespace) — use fingerprint content hashes
You need hash trees or verified streaming — deferred in v0

Installation

Homebrew (Recommended)

brew install cmdrvl/tap/hash

Shell Script

curl -fsSL https://raw.githubusercontent.com/cmdrvl/hash/main/scripts/install.sh | bash

From Source

cargo build --release
./target/release/hash --help

CLI Reference

hash [<INPUT>] [OPTIONS]
hash witness <query|last|count> [OPTIONS]

Arguments

[INPUT]: JSONL manifest file. Defaults to stdin.

Options

Flag	Type	Default	Description
`--algorithm <ALG>`	string	`sha256`	Hash algorithm: `sha256` or `blake3` (case-insensitive)
`--jobs <N>`	integer	CPU count	Parallel workers; `--jobs 1` for sequential
`--no-witness`	flag	`false`	Suppress witness ledger recording
`--describe`	flag	`false`	Print compiled `operator.json` to stdout, exit `0`
`--schema`	flag	`false`	Print JSON Schema to stdout, exit `0`
`--progress`	flag	`false`	Emit structured progress JSONL to stderr
`--version`	flag	`false`	Print `hash <semver>` to stdout, exit `0`

Exit Codes

Code	Meaning
`0`	ALL_HASHED (every record hashed successfully)
`1`	PARTIAL (some records skipped)
`2`	REFUSAL or CLI error

Streams

stdout: enriched JSONL records (one per input record)
stderr: progress diagnostics (with --progress) or warnings

Input / Output Contract

Input

JSONL records from vacuum (or any tool producing vacuum.v0 records). Required fields:

path — absolute file path (used to open and read the file)
version — upstream record version

Output Record

Each input record is enriched with hash fields:

{
  "version": "hash.v0",
  "path": "/data/dec/tape.csv",
  "relative_path": "tape.csv",
  "root": "/data/dec",
  "size": 847201,
  "mtime": "2025-12-15T08:30:00.000Z",
  "extension": ".csv",
  "mime_guess": "text/csv",
  "bytes_hash": "sha256:7d865e959b2466918c9863afca942d0fb7903eb3",
  "hash_algorithm": "sha256",
  "tool_versions": { "vacuum": "0.1.0", "hash": "0.1.0" }
}

Added Field	Type	Description
`bytes_hash`	string	`<algorithm>:<lowercase-hex>` (null if `_skipped`)
`hash_algorithm`	string	`"sha256"` or `"blake3"` (null if `_skipped`)

All upstream fields are passed through. version is updated to "hash.v0". tool_versions is merged with hash added.

Hash Formats

SHA-256: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 (64 hex chars)
BLAKE3: blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262 (64 hex chars)

Skipped Record Handling

Upstream _skipped: Passed through unchanged — hash does NOT attempt to hash, does NOT modify _warnings, DOES update version and tool_versions
New skip: If hash can't read a file, it marks _skipped: true and appends a warning:

{
  "_skipped": true,
  "_warnings": [
    { "tool": "hash", "code": "E_FILE_READ", "message": "Cannot read file", "detail": { "path": "/data/dec/locked.xlsx", "error": "Permission denied" } }
  ]
}

Refusal Codes

Code	Trigger	Next Step
`E_BAD_INPUT`	Not valid JSONL or missing required fields	Check upstream output (run `vacuum` first)
`E_IO`	Cannot read input/output stream	Check stdin/stdout and filesystem

Troubleshooting

"E_BAD_INPUT" — invalid input

You're piping something that isn't valid JSONL. Most common cause: piping raw file paths instead of vacuum output.

# Wrong:
echo "/data/tape.csv" | hash

# Right:
vacuum /data | hash

Some files show `_skipped: true`

hash couldn't read the file (permission denied, file deleted between vacuum and hash). Check the _warnings array:

vacuum /data | hash | jq 'select(._skipped == true) | {path, _warnings}'

Different hashes with `--jobs 1` vs `--jobs 4`

This should not happen — hash guarantees deterministic output order regardless of parallelism. If you see different hashes, the files may have changed between runs.

BLAKE3 vs SHA-256 — which to use?

SHA-256 is the default and most widely verified. BLAKE3 is faster (especially on large files) but produces different hashes. Choose one and stick with it — lock records hash_algorithm so downstream tools know which was used.

hash seems slow on many small files

Parallel workers have overhead per file. For many small files, try --jobs 1 to reduce scheduling overhead. For large files, more --jobs helps.

Limitations

Limitation	Detail
Whole-file hashing only	No range hashing or content-aware hashing — use `fingerprint` content hashes for that
No hash trees	No Merkle tree or `bao` verified streaming — deferred in v0
No XXH3	Non-cryptographic fast hashing not available — deferred in v0
In-memory record buffering	Output ordering requires buffering; not true streaming for very large manifests
No hash verification	hash computes hashes — it doesn't verify stored hashes against current files
No dedup output	hash reports hashes but doesn't flag duplicates — do that downstream

FAQ

Why not just use `shasum`?

shasum produces unstructured text (hash filename). hash produces structured JSONL that preserves all upstream fields, accumulates tool versions, tracks skipped files, and pipes directly into fingerprint and lock. It's also parallel by default.

Why is SHA-256 the default instead of BLAKE3?

SHA-256 is ubiquitous — auditors, compliance teams, and most tooling expect it. BLAKE3 is available via --algorithm blake3 when speed matters more than compatibility.

Does hash read entire files into memory?

No. hash uses a 64 KB streaming buffer — memory usage is constant regardless of file size. A 100 GB file uses the same memory as a 1 KB file.

Why does output order match input order with parallel jobs?

Determinism. Each record gets a sequence number. Workers hash in parallel, but the output thread reorders results by sequence number before emitting. Same input always produces the same output.

Can I hash files without vacuum?

Yes — any JSONL with a path field works. But vacuum provides the standard record format that downstream tools expect.

What happens to upstream `_skipped` records?

hash passes them through without attempting to hash. It updates version to hash.v0 and merges tool_versions, but leaves _skipped, _warnings, and other fields untouched.

Is the empty file hash stable?

Yes. SHA-256 of zero bytes is always sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855. BLAKE3 of zero bytes is always blake3:af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262.

Agent / CI Integration

Self-describing contract

$ hash --describe | jq '.exit_codes'
{
  "0": { "meaning": "ALL_HASHED" },
  "1": { "meaning": "PARTIAL" },
  "2": { "meaning": "REFUSAL" }
}

$ hash --describe | jq '.pipeline'
{
  "upstream": ["vacuum"],
  "downstream": ["fingerprint", "lock", "pack"]
}

Agent workflow

# 1. Hash all artifacts
vacuum /data/dec | hash > hashed.jsonl

case $? in
  0) echo "all hashed" ;;
  1) echo "partial — some files couldn't be hashed"
     jq -s '[.[] | select(._skipped == true)] | length' hashed.jsonl ;;
  2) echo "refusal"
     cat hashed.jsonl | jq '.code'
     exit 1 ;;
esac

# 2. Continue pipeline
cat hashed.jsonl | lock --dataset-id "dec" > dec.lock.json

What makes this agent-friendly

Exit codes — 0/1/2 map to complete/partial/error branching
Structured JSONL only — stdout is always machine-readable
--describe — prints operator.json so an agent discovers the tool without reading docs
--schema — prints the record JSON schema for programmatic validation
Deterministic — same input always produces the same output, enabling reliable caching

Witness Subcommands

hash records every run to an ambient witness ledger. You can query this ledger:

# Query by date range or outcome
hash witness query --tool hash --since 2026-01-01 --outcome ALL_HASHED --json

# Get the most recent run
hash witness last --json

# Count runs matching a filter
hash witness count --since 2026-02-01

Subcommand Reference

hash witness query [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
  [--outcome <ALL_HASHED|PARTIAL|REFUSAL>] [--input-hash <substring>] \
  [--limit <n>] [--json]

hash witness last [--json]

hash witness count [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
  [--outcome <ALL_HASHED|PARTIAL|REFUSAL>] [--input-hash <substring>] [--json]

Exit Codes (witness subcommands)

Code	Meaning
`0`	One or more matching records returned
`1`	No matches (or empty ledger for `last`)
`2`	CLI parse error or witness internal error

Ledger Location

Default: ~/.epistemic/witness.jsonl
Override: set EPISTEMIC_WITNESS environment variable
Malformed ledger lines are skipped; valid lines continue to be processed.

Spec and Development

The full specification is docs/PLAN.md. This README covers intended v0 behavior; the spec adds implementation details, edge-case definitions, and testing requirements.

cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.beads		.beads
.cargo		.cargo
.github/workflows		.github/workflows
docs		docs
schema		schema
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
COVERAGE_AUDIT.md		COVERAGE_AUDIT.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
SECURITY.md		SECURITY.md
operator.json		operator.json

cmdrvl/hash

Folders and files

Latest commit

History

Repository files navigation