Lemelsonbot

Operationalized corpus and methodology distilled from Jerome H. Lemelson's invention notebooks.

Quick Install

curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/lemelsonbot/main/LEMELSON_NOTEBOOKS_EXTRACTED_v1.md -o LEMELSON_NOTEBOOKS_EXTRACTED_v1.md

TL;DR

The Problem

The notebooks are scanned PDFs with Smithsonian headers and repeated metadata.
OCR output is inconsistent and hard to search at scale.

The Solution

A single cleaned corpus file plus a structured methodology distillation that is machine-parseable.

Why Use Lemelsonbot?

Feature	What you get	Why it matters
Cleaned corpus	`LEMELSON_NOTEBOOKS_EXTRACTED_v1.md` with boilerplate removed	Search without noise
Evidence traceability	Quote bank and provenance graph	Every rule points back to sources
Methodology distillation	Triangulated kernel and operator library	Reusable invention heuristics
Validation scripts	`scripts/validate-*.py`	Prevents drift and regressions
Machine markers	HTML comment markers for kernels/operators	Easy downstream parsing

Quick Example

rg -n "feedback" LEMELSON_NOTEBOOKS_EXTRACTED_v1.md | head
python3 scripts/validate-corpus.py
python3 scripts/validate-kernel.py
python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.md
rg -n "OPERATOR_CARD_START" corpus/specs/operator_library.md | head
python3 scripts/validate-operators.py
python3 scripts/validate-kickoffs.py

Design Philosophy

Evidence first: Every operator is anchored to corpus excerpts and quote IDs.
Stable artifacts: Kernel, operator library, and specs are versioned and linted.
Machine-parseable by default: Markers make extraction deterministic.
Progressive disclosure: Glossary and kickoffs let roles work at different depth.
Validation in CI: Scripts encode the contract so changes fail fast.

Comparison

Approach	Cleaned text	Methodology distillation	Validation	Machine markers
Lemelsonbot	Yes	Yes	Yes	Yes
Raw PDFs	No	No	No	No
OCR dump only	Partial	No	No	No
General note archive	Partial	Partial	No	No

Installation

No build step is required. Choose the path that matches how you want to use the data.

Option 1: Download the corpus only (curl)

curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/lemelsonbot/main/LEMELSON_NOTEBOOKS_EXTRACTED_v1.md -o LEMELSON_NOTEBOOKS_EXTRACTED_v1.md

Option 2: Clone the full repo

git clone https://github.com/Dicklesworthstone/lemelsonbot.git
cd lemelsonbot

Option 3: GitHub CLI

gh repo clone Dicklesworthstone/lemelsonbot
cd lemelsonbot

Requirements

Python 3.10+ for validation scripts
rg (ripgrep) for fast searching (optional)

Quick Start

Get the corpus or clone the repo.
Search for a topic:

rg -n "sensor" LEMELSON_NOTEBOOKS_EXTRACTED_v1.md | head

Validate the corpus and kernel:

python3 scripts/validate-corpus.py
python3 scripts/validate-kernel.py

Inspect the methodology artifacts:

less corpus/specs/triangulated_kernel.md
less corpus/specs/operator_library.md

Export the kernel for downstream use:

python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.md

Command Reference

Validate corpus structure

Checks required files and quote bank rules.

python3 scripts/validate-corpus.py

Validate triangulated kernel

Ensures markers and minimum counts for axioms/operators.

python3 scripts/validate-kernel.py

Validate operator library

Checks operator card formatting and tag rules.

python3 scripts/validate-operators.py

Validate session kickoffs

Confirms kickoff files exist and are non-empty.

python3 scripts/validate-kickoffs.py

Extract the kernel

Outputs the kernel block to a standalone file.

python3 scripts/extract-kernel.py --in corpus/specs/triangulated_kernel.md --out artifacts/triangulated_kernel.md

Configuration

No runtime config is required. The repository is convention-based. If you want to change thresholds, edit the constants in scripts/validate-*.py.

Documented defaults (not parsed by code):

# lemelsonbot.defaults.ini
[corpus]
min_quote_count = 300
min_quote_len = 80
max_quote_len = 320

[kernel]
min_axioms = 5
min_operators = 12

Architecture

pdf_originals/ --> extraction --> LEMELSON_NOTEBOOKS_EXTRACTED_v1.md
                                   |
                                   v
                               corpus/primary
                                   |
                                   v
distillations/ --> triangulated_kernel --> operator_library --> artifacts/
          \                         |
           \-> quote_bank ---------/
                    |
                    v
              provenance_graph

Troubleshooting

validate-corpus.py fails with quote count errors: ensure corpus/quote_bank/quote_bank.md has at least 300 entries.
validate-kernel.py fails on markers: check corpus/specs/triangulated_kernel.md for the start/end comments.
validate-operators.py fails on tags: confirm operator cards use allowed tags and required sections.
validate-kickoffs.py fails: verify all kickoff files in corpus/specs/ are present and non-empty.
rg is missing: install with sudo apt install ripgrep or use grep.

Limitations

The repo does not include the original scanned images.
The methodology distillation is interpretive, not a definitive historical record.
There is no automated re-OCR pipeline in this repository.
Validation scripts enforce structure, not historical accuracy.

FAQ

Q: Is the full corpus in one file?
A: Yes. LEMELSON_NOTEBOOKS_EXTRACTED_v1.md is the single cleaned corpus file.

Q: Can I cite the extracted text?
A: Cite the Smithsonian source and the original notebook identifiers listed in the corpus.

Q: How do I parse the operator library programmatically?
A: Use the HTML comment markers that wrap each operator card and the kernel block.

Q: Why does the corpus remove Smithsonian boilerplate?
A: It keeps identifiers but removes repeated metadata so searches are meaningful.

Q: Do I need Python to read the corpus?
A: No. Python is only required for the validation and extraction scripts.

About Contributions

About Contributions: Please don't take this the wrong way, but I do not accept outside contributions for any of my projects. I simply don't have the mental bandwidth to review anything, and it's my name on the thing, so I'm responsible for any problems it causes; thus, the risk-reward is highly asymmetric from my perspective. I'd also have to worry about other "stakeholders," which seems unwise for tools I mostly make for myself for free. Feel free to submit issues, and even PRs if you want to illustrate a proposed fix, but know I won't merge them directly. Instead, I'll have Claude or Codex review submissions via gh and independently decide whether and how to address them. Bug reports in particular are welcome. Sorry if this offends, but I want to avoid wasted time and hurt feelings. I understand this isn't in sync with the prevailing open-source ethos that seeks community contributions, but it's the only way I can move at this velocity and keep my sanity.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
artifacts		artifacts
corpus		corpus
docs		docs
scripts		scripts
specs		specs
.gitignore		.gitignore
LEMELSON_NOTEBOOKS_EXTRACTED_v1.md		LEMELSON_NOTEBOOKS_EXTRACTED_v1.md
LICENSE		LICENSE
README.md		README.md
analyze_header.py		analyze_header.py
debug_extract.py		debug_extract.py
extract_test.py		extract_test.py
gh_og_share_image.png		gh_og_share_image.png
lemelson		lemelson
test_execution.py		test_execution.py
test_extract.py		test_extract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lemelsonbot

Quick Install

TL;DR

Quick Example

Design Philosophy

Comparison

Installation

Option 1: Download the corpus only (curl)

Option 2: Clone the full repo

Option 3: GitHub CLI

Quick Start

Command Reference

Validate corpus structure

Validate triangulated kernel

Validate operator library

Validate session kickoffs

Extract the kernel

Configuration

Architecture

Troubleshooting

Limitations

FAQ

About Contributions

License

About

Uh oh!

Releases

Packages

Languages

License

joyshmitz/lemelsonbot

Folders and files

Latest commit

History

Repository files navigation

Lemelsonbot

Quick Install

TL;DR

Quick Example

Design Philosophy

Comparison

Installation

Option 1: Download the corpus only (curl)

Option 2: Clone the full repo

Option 3: GitHub CLI

Quick Start

Command Reference

Validate corpus structure

Validate triangulated kernel

Validate operator library

Validate session kickoffs

Extract the kernel

Configuration

Architecture

Troubleshooting

Limitations

FAQ

About Contributions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages