LLM reliability research from Hassana Labs.
Three tools, one goal: catch models that know but don't use.
Ask Claude to count the r's in "strawberry." It writes "s-t-r-a-w-b-e-r-r-y," identifies each r, gets to 3. Then outputs "2."
The model didn't lack information. The answer was right there—in text it generated moments earlier. The computation worked. The routing failed.
This toolkit detects those failures mathematically.
pip install pythea
python -m strawberry.factual_recall \
--question "Which US senators from Minnesota graduated from Princeton" \
--out report.jsonWhat it catches:
- RAG that retrieves but doesn't read
- Chain-of-thought that cites steps it ignored
- Self-verification that validates without checking
- Citation confabulation (decorative sources)
How: Scrub the cited evidence, measure confidence change. No change? The model was confabulating.
Lightweight client for the Thea Mini Reasoning API.
from pythea import TheaClient
with TheaClient(base_url="https://...") as client:
resp = client.unified_answer(
question="What is 2+2?",
backend="aoai-pool",
m=6,
)
print(resp.get("answer"))Model-agnostic permutation-mixture evaluation via Bernoulli first-token logprob probes.
from pythea.offline import qmv
res = qmv.evaluate_permutation_family(
probe=probe,
parts=parts,
cfg=qmv.PermutationEvalConfig(m=6, num_bands=2, seed=0),
)
print(res.q_bar, res.q_lo, res.js_bound)git clone https://github.com/leochlon/pythea.git
cd pythea
pip install -e .Extras:
pip install -e ".[dev]" # tests
pip install -e ".[offline]" # tiktoken for logit bias
pip install -e ".[vllm]" # local inferencepythea/
├── strawberry/ # Procedural hallucination toolkit
│ ├── README.md
│ └── src/
├── src/pythea/ # Thea client + QMV probing
├── docs/ # Detailed documentation
├── examples/
├── tests/
└── benchmarks/
@article{hassanalabs2026procedural,
title={An Information-Theoretic and Causal Theory of Procedural Hallucinations},
author={{Hassana Labs}},
journal={arXiv preprint},
year={2026}
}MIT — see LICENSE.md