A fast, pragmatic CLI & library for multi-language text analysis across .txt, .pdf, .docx, and .odt files.
- Unicode-aware tokenization
- Optional stopword filtering (custom list)
- Optional stemming (auto-detected or forced language)
- N‑gram counts
- Word frequencies
- Context stats (±N) & direct neighbors (±1)
- Collocation analysis with Pointwise Mutual Information (PMI) for all word pairs in the context window
- Named‑Entity extraction (simple capitalization heuristic)
- Parallel per‑file compute (safe, serialized writes)
- Combined (Map‑Reduce) mode to aggregate multiple files
- Deterministic, sorted exports (CSV/TSV/JSON/TXT)
- Robust I/O: errors are reported, never panic
-
With cargo:
cargo install text_analysis
-
Download binary from Releases
-
Clone the repository and build from source
# Default TXT summary (one file)
text_analysis <path>
# CSV exports (multiple files: ngrams, wordfreq, context, neighbors, pmi, namedentities)
text_analysis <path> --export-format csv
# Combine all files into one corpus (Map-Reduce) and export as JSON
text_analysis <path> --combine --export-format jsonPath can be a file or a directory (recursively scanned). Supported: .txt, .pdf, .docx, .odt.
text_analysis <path> [--stopwords <FILE>] [--ngram N] [--context N]
[--export-format {txt|csv|tsv|json}] [--entities-only]
[--combine]
[--stem] [--stem-lang <CODE>] [--stem-strict]
--stopwords <FILE>– optional stopword list (one token per line).--ngram N– n‑gram size (default: 2).--context N– context window size for context & PMI (default: 5).--export-format–txt(default),csv,tsv,json.--entities-only– only export Named Entities (skips other tables).--combine– analyze all files as one corpus (Map‑Reduce) and write a single set of outputs.--stem– enable stemming with auto language detection.--stem-lang <CODE>– force stemming language (e.g.,en,de,fr,es,it,pt,nl,ru,sv,fi,no,ro,hu,da,tr).--stem-strict– in auto mode, require detectable & supported language:- Per‑file mode: files without detectable/supported language are skipped (reported).
- Combined mode: the whole run aborts (prevents mixed stemming).
When the CLI finishes, it prints a concise summary to stdout. The order is tuned for usefulness:
- Top 20 N‑grams (count ↓, lexicographic tie‑break)
- Top 20 PMI pairs (count ↓, then PMI ↓, then words)
- Top 20 words (count ↓, lexicographic tie‑break)
This surfaces phrases and salient collocations before common function words.
- Exactly one file per run:
<stem>_<timestamp>_summary.txt
Contains the three sorted blocks (Top 20 N‑grams → Top 20 PMI → Top 20 words).
-
Multiple files per run (one per analysis):
<stem>_<timestamp>_ngrams.<ext><stem>_<timestamp>_wordfreq.<ext><stem>_<timestamp>_context.<ext><stem>_<timestamp>_neighbors.<ext><stem>_<timestamp>_pmi.<ext><stem>_<timestamp>_namedentities.<ext>
| File suffix | Contents | Notes |
|---|---|---|
_ngrams.<ext> |
List of all observed n-grams and their counts | Sorted by count ↓, then lexicographically ↑ |
_wordfreq.<ext> |
Word frequency table (unigrams only) | Sorted by count ↓, then lexicographically ↑ |
_context.<ext> |
Directed co-occurrence counts for all tokens in a ±N window around each center token | Window size set by --context (default 5); includes all words except the center word |
_neighbors.<ext> |
Directed co-occurrence counts for immediate left/right neighbors (±1 distance) | Always exactly one left and one right position per center token |
_pmi.<ext> |
Word pairs within the context window with their counts, distances, and Pointwise Mutual Information | Pairs are unordered in storage, sorted by count ↓, PMI ↓ in export |
_namedentities.<ext> |
Named entities detected via capitalization heuristic and their counts | Case-sensitive; ignores acronyms and common articles/determiners |
Sorting rules applied to all tabular exports:
- N‑grams & Wordfreq: by count desc, then key asc.
- Context & Neighbors (flattened): by count desc, then keys.
- PMI: by count desc, then PMI desc, then words.
With --combine, all inputs are processed as one corpus and exported once with stem "combined":
combined_<timestamp>_wordfreq.<ext>,combined_<timestamp>_ngrams.<ext>, …
<stem> is collision‑safe: derived from the file name plus a short path hash. In per‑file mode each input gets its own stem; in combined mode the stem is literally combined.
Add to Cargo.toml:
[dependencies]
text_analysis = "0.4.7"Basic example:
use std::collections::HashSet;
use text_analysis::*;
fn main() -> Result<(), String> {
let text = "The quick brown fox jumps over the lazy dog.";
let opts = AnalysisOptions {
ngram: 2,
context: 5,
export_format: ExportFormat::Json,
entities_only: false,
combine: false,
stem_mode: StemMode::Off,
stem_require_detected: false,
};
let stop = HashSet::new();
let result = analyze_text_with(text, &stop, &opts);
println!("Top words: {:?}", result.wordfreq);
Ok(())
}- Token starts with an uppercase letter
- Token is not all uppercase (filters acronyms)
- Filters very common determiners/articles across DE/EN/FR/ES/IT
Counts are case‑sensitive and computed on original tokens (not stemmed).
StemMode::Off– no stemmingStemMode::Auto– language viawhatlang; stem if supportedStemMode::Force(lang)– use a specific stemmer
stem_require_detected controls strictness in auto mode (see CLI).
Uses pdf-extract. Files that fail to parse are listed in the warnings and don’t abort the run.
- DOCX: Parsed natively (pure Rust) by reading
word/document.xmland extracting text content. - ODT: Parsed natively (pure Rust) by reading
content.xmland extracting text content.
Notes:
- Extraction focuses on plaintext content for analysis; complex formatting, headers/footers, and footnotes may be ignored.
- Files that fail to parse are listed in the warnings and don't abort the run.
- Use
--export-format csv(ortsv/json) for downstream analysis in pandas/R/Excel. - In noisy corpora, prefer
--ngram 2or--ngram 3and check PMI first. - For mixed‑language corpora, consider
--stem-strictto avoid inconsistent stemming.
MIT
If you open exports in Excel/LibreOffice, cells that begin with =, +, -, or @ can be interpreted
as formulas. The recommended approach is:
- Use a proper CSV library (this project uses
csv::Writer) for escaping. - Prefix a
'for any text cell that starts with one of those characters.
This prevents spreadsheet software from executing user-provided content.
