Operationalized corpus and methodology distilled from Jerome H. Lemelson's invention notebooks.
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/lemelsonbot/main/LEMELSON_NOTEBOOKS_EXTRACTED_v1.md -o LEMELSON_NOTEBOOKS_EXTRACTED_v1.mdThe Problem
- The notebooks are scanned PDFs with Smithsonian headers and repeated metadata.
- OCR output is inconsistent and hard to search at scale.
The Solution
- A single cleaned corpus file plus a structured methodology distillation that is machine-parseable.
Why Use Lemelsonbot?
| Feature | What you get | Why it matters |
|---|---|---|
| Cleaned corpus | LEMELSON_NOTEBOOKS_EXTRACTED_v1.md with boilerplate removed |
Search without noise |
| Evidence traceability | Quote bank and provenance graph | Every rule points back to sources |
| Methodology distillation | Triangulated kernel and operator library | Reusable invention heuristics |
| Validation scripts | scripts/validate-*.py |
Prevents drift and regressions |
| Machine markers | HTML comment markers for kernels/operators | Easy downstream parsing |
rg -n "feedback" LEMELSON_NOTEBOOKS_EXTRACTED_v1.md | head
python3 scripts/validate-corpus.py
python3 scripts/validate-kernel.py
python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.md
rg -n "OPERATOR_CARD_START" corpus/specs/operator_library.md | head
python3 scripts/validate-operators.py
python3 scripts/validate-kickoffs.py- Evidence first: Every operator is anchored to corpus excerpts and quote IDs.
- Stable artifacts: Kernel, operator library, and specs are versioned and linted.
- Machine-parseable by default: Markers make extraction deterministic.
- Progressive disclosure: Glossary and kickoffs let roles work at different depth.
- Validation in CI: Scripts encode the contract so changes fail fast.
| Approach | Cleaned text | Methodology distillation | Validation | Machine markers |
|---|---|---|---|---|
| Lemelsonbot | Yes | Yes | Yes | Yes |
| Raw PDFs | No | No | No | No |
| OCR dump only | Partial | No | No | No |
| General note archive | Partial | Partial | No | No |
No build step is required. Choose the path that matches how you want to use the data.
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/lemelsonbot/main/LEMELSON_NOTEBOOKS_EXTRACTED_v1.md -o LEMELSON_NOTEBOOKS_EXTRACTED_v1.mdgit clone https://github.com/Dicklesworthstone/lemelsonbot.git
cd lemelsonbotgh repo clone Dicklesworthstone/lemelsonbot
cd lemelsonbotRequirements
- Python 3.10+ for validation scripts
rg(ripgrep) for fast searching (optional)
- Get the corpus or clone the repo.
- Search for a topic:
rg -n "sensor" LEMELSON_NOTEBOOKS_EXTRACTED_v1.md | head- Validate the corpus and kernel:
python3 scripts/validate-corpus.py
python3 scripts/validate-kernel.py- Inspect the methodology artifacts:
less corpus/specs/triangulated_kernel.md
less corpus/specs/operator_library.md- Export the kernel for downstream use:
python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.mdChecks required files and quote bank rules.
python3 scripts/validate-corpus.pyEnsures markers and minimum counts for axioms/operators.
python3 scripts/validate-kernel.pyChecks operator card formatting and tag rules.
python3 scripts/validate-operators.pyConfirms kickoff files exist and are non-empty.
python3 scripts/validate-kickoffs.pyOutputs the kernel block to a standalone file.
python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.mdNo runtime config is required. The repository is convention-based. If you want
to change thresholds, edit the constants in scripts/validate-*.py.
Documented defaults (not parsed by code):
# lemelsonbot.defaults.ini
[corpus]
min_quote_count = 300
min_quote_len = 80
max_quote_len = 320
[kernel]
min_axioms = 5
min_operators = 12pdf_originals/ --> extraction --> LEMELSON_NOTEBOOKS_EXTRACTED_v1.md
|
v
corpus/primary
|
v
distillations/ --> triangulated_kernel --> operator_library --> artifacts/
\ |
\-> quote_bank ---------/
|
v
provenance_graph
validate-corpus.pyfails with quote count errors: ensurecorpus/quote_bank/quote_bank.mdhas at least 300 entries.validate-kernel.pyfails on markers: checkcorpus/specs/triangulated_kernel.mdfor the start/end comments.validate-operators.pyfails on tags: confirm operator cards use allowed tags and required sections.validate-kickoffs.pyfails: verify all kickoff files incorpus/specs/are present and non-empty.rgis missing: install withsudo apt install ripgrepor usegrep.
- The repo does not include the original scanned images.
- The methodology distillation is interpretive, not a definitive historical record.
- There is no automated re-OCR pipeline in this repository.
- Validation scripts enforce structure, not historical accuracy.
Q: Is the full corpus in one file?
A: Yes. LEMELSON_NOTEBOOKS_EXTRACTED_v1.md is the single cleaned corpus file.
Q: Can I cite the extracted text?
A: Cite the Smithsonian source and the original notebook identifiers listed in the corpus.
Q: How do I parse the operator library programmatically?
A: Use the HTML comment markers that wrap each operator card and the kernel block.
Q: Why does the corpus remove Smithsonian boilerplate?
A: It keeps identifiers but removes repeated metadata so searches are meaningful.
Q: Do I need Python to read the corpus?
A: No. Python is only required for the validation and extraction scripts.
About Contributions: Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via
ghand independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.
No license specified. All rights reserved.
