A CLI tool to crawl developer documentation websites and save each page as a Markdown file.
This tool uses crawl4ai to perform deep crawling and extract content suitable for ingestion into RAG pipelines or direct use with LLMs.
- Crawls websites starting from a given URL.
- Uses
crawl4ai's deep crawling (BFSDeepCrawlStrategyby default). - Stays within the original domain (does not follow external links).
- Saves the markdown content of each successfully crawled page.
- Organizes output into a subdirectory named after the crawled domain (e.g.,
output_dir/docs_example_com/). - Attempts to preserve the URL path structure within the domain subdirectory.
- Offers a streaming mode (
--stream, enabled by default) to process pages as they arrive.
Using pipx is recommended as it installs the tool and its dependencies in an isolated environment, preventing conflicts with other Python projects.
# Ensure you have Python 3.12+ and pipx installed (pip install pipx)
pipx install devdocs-crawler
# To upgrade later:
pipx upgrade devdocs-crawlerYou can also install using pip directly (ideally within a virtual environment):
# Ensure you have Python 3.12+ installed
pip install devdocs-crawler# Basic usage (crawl depth 1, stream enabled by default)
# Saves to ./devdocs_crawler_output/<domain_name>/
# Example: Saves to ./devdocs_crawler_output/docs_python_org/
devdocs-crawler https://docs.python.org/3/
# Specify a different base output directory
# Example: Saves to ./python_docs/docs_python_org/
devdocs-crawler https://docs.python.org/3/ -o ./python_docs
# Example: Crawl Neo4j GDS docs (depth 2)
# Example: Saves to ./devdocs_crawler_output/neo4j_com/
devdocs-crawler https://neo4j.com/docs/graph-data-science/current/ -d 2
# Example: Disable streaming
devdocs-crawler https://docs.example.com --no-streamOptions:
start_url: (Required) The starting URL for the crawl (must include scheme likehttps://).-o, --output DIRECTORY: Base directory to save crawl-specific subdirectories (default:./devdocs_crawler_output).-d, --depth INTEGER: Crawling depth beyond the start URL (0 = start URL only, 1 = start URL + linked pages, etc.) (default: 1).--max-pages INTEGER: Maximum total number of pages to crawl (default: no limit).--stream / --no-stream: Streaming mode processes pages as they arrive. Enabled by default. Use--no-streamto disable it and process all pages after the crawl finishes.-v, --verbose: Increase logging verbosity (-v for INFO, -vv for DEBUG). Default is WARNING.--version: Show the package version and exit.-h, --help: Show the help message and exit.
- Clone the repository:
git clone https://github.com/youssef-tharwat/devdocs-crawler(Replace with your fork if contributing) - Navigate to the project directory:
cd devdocs-crawler - Install
uv: If you don't have it, installuv(e.g.,pip install uvor see uv installation docs). - Create environment & Install: Use
uvto create an environment and install dependencies (including dev dependencies). Requires Python 3.12+.(Alternatively, if you prefer manual venv:uv venv # Creates .venv uv sync --dev # Syncs based on pyproject.toml
python3.12 -m venv .venv,source .venv/bin/activate, thenuv pip install -e .[dev]) - Activate the environment:
- macOS/Linux:
source .venv/bin/activate - Windows:
.venv\Scripts\activate
- macOS/Linux:
Now you can run the tool using devdocs-crawler from within the activated environment.
You can run linters and formatters:
ruff check .
ruff format .And run tests (if/when tests are added):
pytest- Ensure your
pyproject.tomlhas the correct version number and author details. - Build the distributions:
This creates wheel and source distributions in the
uv build
dist/directory. - Publish to PyPI (requires a PyPI account and an API token configured with
uv):You can also publish to TestPyPI usinguv publish
uv publish --repository testpypi. Seeuv publish --helpfor more options, including providing tokens via environment variables or arguments.
Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines (if one exists).
This project is licensed under the MIT License - see the LICENSE file for details.