中文 | English
This project implements a suffix array based on the doubling algorithm, supporting efficient string processing and suitable for UTF-8 text. The core part is written in C++ and wrapped as a Python extension module via Cython.
-
Time Complexity:
$\mathcal{O}(n\log n)$ -
Space Complexity:
$\mathcal{O}(n)$ (with a large constant, be cautious with large datasets) - Multilingual Support: Input must be UTF-8 encoded
size(): Returns the length of the textget_id(suf_rank): Given a suffix rank (1-indexed), returns its position in the text (0-indexed)get_suf(suf_rank): Given a suffix rank (1-indexed), returns the corresponding suffixget_rank(suf_id): Given the left boundary position of a suffix (0-indexed), returns its lexicographical rankget_count(pattern): Given a pattern string, returns the number of its occurrences in the textget_prob(prompt): Given a text prompt, returns the probability distribution of the next characterget_branch_entropy(prompt): Given a text prompt, returns its right branch entropyget_mutual_information(text): Given a text, returns its mutual information
It is recommended to use pip for local development installation:
pip install -e .Or manually build the extension module:
python setup.py build_ext --inplaceThe generated library file (e.g., SuffixArray.*.so) can be placed in your project directory and imported in Python code.
test.ipynb provides simple test cases. If you encounter unexpected outputs, feel free to open an issue.
- Currently, a linear-time construction algorithm is not used
- The space constant is large, not suitable for ultra-large datasets
- Test cases are limited, mainly for personal study and experimentation
This project is licensed under the MIT License. Free to use and modify.