Artifact inventory — deterministic directory scanning that produces a sorted JSONL manifest of every file, its size, timestamp, and MIME type.
No AI. No inference. Pure deterministic enumeration and serialization.
brew install cmdrvl/tap/vacuumThe Problem: Before you can hash, fingerprint, or lock data artifacts, you need to know what's there. Teams rely on find, ls, or ad-hoc scripts that produce inconsistent, non-reproducible inventories with no structured output.
The Solution: One command that walks directories and emits a deterministic, sorted JSONL manifest. Every file gets a record with path, size, mtime, extension, and MIME guess. Same inputs always produce the same output.
| Feature | What It Does |
|---|---|
| Deterministic | Sorted by (relative_path, root) — same directory always produces the same manifest |
| Structured JSONL | Every record is a machine-readable JSON object — no parsing ls output |
| Skipped tracking | Files that can't be stat'd are captured with warnings, not silently dropped |
| Multi-root | Scan multiple directories in one invocation — records interleaved deterministically |
| Include/Exclude | Glob patterns filter what enters the manifest |
| Pipeline native | Feeds directly into hash, fingerprint, lock |
| Audit trail | Every run recorded in the ambient witness ledger |
$ vacuum /data/dec{"version":"vacuum.v0","path":"/data/dec/model.xlsx","relative_path":"model.xlsx","root":"/data/dec","size":2481920,"mtime":"2025-12-31T12:00:00.000Z","extension":".xlsx","mime_guess":"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet","tool_versions":{"vacuum":"0.1.0"}}
{"version":"vacuum.v0","path":"/data/dec/tape.csv","relative_path":"tape.csv","root":"/data/dec","size":847201,"mtime":"2025-12-15T08:30:00.000Z","extension":".csv","mime_guess":"text/csv","tool_versions":{"vacuum":"0.1.0"}}Two files inventoried — sorted, typed, timestamped, ready for hash.
# Filter to specific extensions:
$ vacuum /data/dec --include "*.csv" --include "*.xlsx"
# Exclude scratch files:
$ vacuum /data --exclude "*.tmp" --exclude ".DS_Store"
# Multiple roots, single manifest:
$ vacuum /data/q3 /data/q4 > q3q4-manifest.jsonl
# Full pipeline into lockfile:
$ vacuum /data/dec | hash | fingerprint --fp csv.v0 \
| lock --dataset-id "dec" > dec.lock.jsonvacuum is the first tool in the stream pipeline — it discovers what exists.
vacuum → hash → fingerprint → lock → pack
(scan) (hash) (template) (pin) (seal)
Each tool reads JSONL from stdin and emits enriched JSONL to stdout. vacuum starts the chain by walking directories and producing the initial manifest.
vacuum does not replace downstream tools.
| If you need... | Use |
|---|---|
| Compute SHA256/BLAKE3 hashes | hash |
| Match files against template definitions | fingerprint |
| Pin artifacts into a self-hashed lockfile | lock |
| Check structural comparability of CSVs | shape |
| Explain numeric changes between CSVs | rvl |
| Bundle into immutable evidence packs | pack |
vacuum only answers: what files exist, how big are they, and when were they last modified?
vacuum emits exactly one domain outcome. Note: there is no exit code 1 — either the scan completes or it refuses.
All roots were enumerated. Individual file-level failures (e.g., permission denied on a single file) are recorded as _skipped records in the output stream — they don't prevent a successful scan.
$ vacuum /data/dec
# exit 0 — all files inventoried (some may be _skipped)Cannot begin scanning. The root directory doesn't exist, isn't readable, or a filesystem error prevents the scan from starting.
{
"code": "E_ROOT_NOT_FOUND",
"message": "Root path does not exist",
"detail": { "root": "/data/nonexistent/" },
"next_command": null
}Refusals always include the error code and detail.
| Capability | vacuum | find / ls |
Custom script | tree --json |
|---|---|---|---|---|
| Deterministic sorted output | Yes | No | Depends | No |
| Structured JSONL records | Yes | No | You write it | Partial |
| Skipped file tracking | Yes (with warnings) | Silent | You write it | No |
| Multi-root interleaved output | Yes | No | You write it | No |
| Include/exclude glob filters | Yes | find only |
You write it | No |
| MIME type guessing | Yes | No | You write it | No |
| Pipeline integration (hash/lock) | Yes | No | No | No |
| Audit trail (witness ledger) | Yes | No | No | No |
When to use vacuum:
- Start of a data pipeline — discover what artifacts exist before hashing and locking
- Audit and compliance — produce a reproducible inventory of a directory
- CI automation — machine-readable manifests that feed into downstream tools
When vacuum might not be ideal:
- You need recursive file search with complex predicates — use
find - You need file content analysis — use downstream tools (
hash,fingerprint,shape) - You need real-time filesystem watching — vacuum is a point-in-time snapshot
brew install cmdrvl/tap/vacuumcurl -fsSL https://raw.githubusercontent.com/cmdrvl/vacuum/main/scripts/install.sh | bashcargo build --release
./target/release/vacuum --helpvacuum <ROOT>... [OPTIONS]
vacuum witness <query|last|count> [OPTIONS]<ROOT>...: One or more directories to scan. At least one required.
| Flag | Type | Default | Description |
|---|---|---|---|
--include <GLOB> |
string | all files | Include pattern (repeatable) |
--exclude <GLOB> |
string | none | Exclude pattern (repeatable) |
--no-follow |
flag | false |
Do not follow symlinks |
--no-witness |
flag | false |
Suppress witness ledger recording |
--describe |
flag | false |
Print compiled operator.json to stdout, exit 0 |
--schema |
flag | false |
Print JSONL record JSON schema, exit 0 |
--progress |
flag | false |
Emit structured progress JSONL to stderr |
--version |
flag | false |
Print vacuum <semver> to stdout, exit 0 |
| Code | Meaning |
|---|---|
0 |
SCAN_COMPLETE (all roots enumerated) |
2 |
REFUSAL or CLI error |
stdout: JSONL manifest records (one per file)stderr: progress diagnostics (with--progress) or warnings
Every discovered file produces one vacuum.v0 record:
{
"version": "vacuum.v0",
"path": "/data/dec/tape.csv",
"relative_path": "tape.csv",
"root": "/data/dec",
"size": 847201,
"mtime": "2025-12-15T08:30:00.000Z",
"extension": ".csv",
"mime_guess": "text/csv",
"tool_versions": { "vacuum": "0.1.0" }
}| Field | Type | Nullable | Description |
|---|---|---|---|
version |
string | no | Always "vacuum.v0" |
path |
string | no | Absolute path (OS-native separators) |
relative_path |
string | no | Path relative to root (forward slashes) |
root |
string | no | Absolute path of scan root |
size |
u64 | no | File size in bytes |
mtime |
string | no | ISO 8601 UTC with millisecond precision |
extension |
string | yes | File extension including dot (null if none) |
mime_guess |
string | yes | MIME type from extension lookup (null if unknown) |
tool_versions |
object | no | { "vacuum": "<semver>" } |
Files that can't be stat'd (permission denied, broken symlinks) produce a skipped record:
{
"version": "vacuum.v0",
"path": "/data/dec/protected.xlsx",
"relative_path": "protected.xlsx",
"root": "/data/dec",
"size": null,
"mtime": null,
"extension": ".xlsx",
"mime_guess": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"_skipped": true,
"_warnings": [
{ "tool": "vacuum", "code": "E_FILE_PERMISSION", "message": "Cannot read file metadata", "detail": { "error": "Permission denied" } }
],
"tool_versions": { "vacuum": "0.1.0" }
}Skipped records flow downstream — hash passes them through, lock collects them in the skipped array.
| Code | Trigger | Next Step |
|---|---|---|
E_ROOT_NOT_FOUND |
Root path doesn't exist | Check path spelling and that directory exists |
E_ROOT_PERMISSION |
Can't read root directory | Check directory permissions |
E_IO |
Filesystem error preventing scan start | Check disk/mount health |
Multiple roots: fail-fast on the first failing root.
Check that the path is correct and the directory exists:
ls -la /data/dec/ # verify directory exists
vacuum /data/decYou don't have permission to read the root directory:
ls -la /data/ # check permissions on parent
# Fix: adjust permissions or run with appropriate accessIndividual files may be skipped due to permission issues or broken symlinks. These still produce records with _skipped: true — they don't prevent the scan from completing (exit 0). Check the _warnings array for details:
vacuum /data/dec | jq 'select(._skipped == true) | .path'The directory exists but contains no files matching your include/exclude patterns:
# Check what's actually in the directory:
ls -la /data/dec/
# Check your patterns:
vacuum /data/dec --include "*.csv" # too narrow?Patterns are matched against relative_path (forward-slash normalized), not the absolute path:
# Wrong — matching against absolute path:
vacuum /data --include "/data/*.csv"
# Right — matching against relative path:
vacuum /data --include "*.csv"
vacuum /data --include "subdir/*.csv"| Limitation | Detail |
|---|---|
| In-memory collection | All records collected before emission (deterministic ordering requires it) |
| Extension-based MIME | MIME guessing uses file extension, not content sniffing — unknown extensions → null |
| No content hashing | vacuum doesn't read file contents — use hash for that |
| No recursive exclude | --exclude patterns match against relative paths, not directory tree structure |
| Point-in-time snapshot | No file watching — re-run vacuum to detect changes |
| No exit code 1 | Per-file failures are _skipped records, not partial outcomes (unlike hash/lock) |
find produces unstructured text that requires parsing. vacuum produces deterministic JSONL with rich metadata (size, mtime, MIME type, extension) that pipes directly into the rest of the pipeline. Same directory always produces identical output.
vacuum's job is enumeration, not transformation. Either the scan starts (exit 0) or it can't (exit 2). Per-file issues like permission denied are recorded as _skipped records in the output stream — they don't prevent the scan from completing.
Deterministic ordering. Records are sorted by (relative_path, root) byte-order, which requires seeing all files before emitting any. This trades latency for reproducibility.
Records from all roots are interleaved by relative_path, with root as tiebreaker. This means files with the same relative path from different roots appear adjacent in the output.
Yes — pass multiple roots:
vacuum /data/q3 /data/q4 /data/q1All files from all roots appear in a single sorted manifest.
By default, vacuum follows symlinks and resolves targets to canonical paths. Use --no-follow to skip symlinks entirely.
Extension-based lookup from a built-in table: .csv, .tsv, .txt, .json, .jsonl, .xml, .pdf, .xlsx, .xls, .parquet, .zip, .gz, .yaml/.yml, and others. Unknown extensions produce null.
$ vacuum --describe | jq '.exit_codes'
{
"0": { "meaning": "SCAN_COMPLETE" },
"2": { "meaning": "REFUSAL" }
}
$ vacuum --describe | jq '.pipeline'
{
"upstream": [],
"downstream": ["hash", "fingerprint", "lock", "pack"]
}# 1. Scan directory
vacuum /data/dec > manifest.jsonl
case $? in
0) echo "scan complete"
wc -l manifest.jsonl ;;
2) echo "refusal"
cat manifest.jsonl | jq '.code'
exit 1 ;;
esac
# 2. Count skipped files
skipped=$(jq -s '[.[] | select(._skipped == true)] | length' manifest.jsonl)
echo "skipped: $skipped"
# 3. Pipe into rest of pipeline
cat manifest.jsonl | hash | lock --dataset-id "dec" > dec.lock.json- Exit codes —
0/2map to success/error branching (no ambiguous exit 1) - Structured JSONL only — stdout is always machine-readable
--describe— printsoperator.jsonso an agent discovers the tool without reading docs--schema— prints the record JSON schema for programmatic validation- Skipped records inline — agents can filter
_skippedrecords without separate error streams
Witness Subcommands
vacuum records every scan to an ambient witness ledger. You can query this ledger:
# Query by date range or outcome
vacuum witness query --tool vacuum --since 2026-01-01 --outcome SCAN_COMPLETE --json
# Get the most recent scan
vacuum witness last --json
# Count scans matching a filter
vacuum witness count --since 2026-02-01vacuum witness query [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
[--outcome <SCAN_COMPLETE|REFUSAL>] [--input-hash <substring>] \
[--limit <n>] [--json]
vacuum witness last [--json]
vacuum witness count [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
[--outcome <SCAN_COMPLETE|REFUSAL>] [--input-hash <substring>] [--json]| Code | Meaning |
|---|---|
0 |
One or more matching records returned |
1 |
No matches (or empty ledger for last) |
2 |
CLI parse error or witness internal error |
- Default:
~/.epistemic/witness.jsonl - Override: set
EPISTEMIC_WITNESSenvironment variable - Malformed ledger lines are skipped; valid lines continue to be processed.
The full specification is docs/PLAN.md. This README covers intended v0 behavior; the spec adds implementation details, edge-case definitions, and testing requirements.
cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test