DevDoc Crawler

A CLI tool to crawl developer documentation websites and save each page as a Markdown file.

This tool uses crawl4ai to perform deep crawling and extract content suitable for ingestion into RAG pipelines or direct use with LLMs.

Features

Crawls websites starting from a given URL.
Uses crawl4ai's deep crawling (BFSDeepCrawlStrategy by default).
Stays within the original domain (does not follow external links).
Saves the markdown content of each successfully crawled page.
Organizes output into a subdirectory named after the crawled domain (e.g., output_dir/docs_example_com/).
Attempts to preserve the URL path structure within the domain subdirectory.
Offers a streaming mode (--stream, enabled by default) to process pages as they arrive.

Installation (Recommended: pipx)

Using pipx is recommended as it installs the tool and its dependencies in an isolated environment, preventing conflicts with other Python projects.

# Ensure you have Python 3.12+ and pipx installed (pip install pipx)

pipx install devdocs-crawler

# To upgrade later:
pipx upgrade devdocs-crawler

Alternative Installation (pip)

You can also install using pip directly (ideally within a virtual environment):

# Ensure you have Python 3.12+ installed

pip install devdocs-crawler

Usage

# Basic usage (crawl depth 1, stream enabled by default)
# Saves to ./devdocs_crawler_output/<domain_name>/
# Example: Saves to ./devdocs_crawler_output/docs_python_org/
devdocs-crawler https://docs.python.org/3/

# Specify a different base output directory
# Example: Saves to ./python_docs/docs_python_org/
devdocs-crawler https://docs.python.org/3/ -o ./python_docs

# Example: Crawl Neo4j GDS docs (depth 2)
# Example: Saves to ./devdocs_crawler_output/neo4j_com/
devdocs-crawler https://neo4j.com/docs/graph-data-science/current/ -d 2

# Example: Disable streaming
devdocs-crawler https://docs.example.com --no-stream

Options:

start_url: (Required) The starting URL for the crawl (must include scheme like https://).
-o, --output DIRECTORY: Base directory to save crawl-specific subdirectories (default: ./devdocs_crawler_output).
-d, --depth INTEGER: Crawling depth beyond the start URL (0 = start URL only, 1 = start URL + linked pages, etc.) (default: 1).
--max-pages INTEGER: Maximum total number of pages to crawl (default: no limit).
--stream / --no-stream: Streaming mode processes pages as they arrive. Enabled by default. Use --no-stream to disable it and process all pages after the crawl finishes.
-v, --verbose: Increase logging verbosity (-v for INFO, -vv for DEBUG). Default is WARNING.
--version: Show the package version and exit.
-h, --help: Show the help message and exit.

Development

Clone the repository: git clone https://github.com/youssef-tharwat/devdocs-crawler (Replace with your fork if contributing)
Navigate to the project directory: cd devdocs-crawler
Install uv: If you don't have it, install uv (e.g., pip install uv or see uv installation docs).
Create environment & Install: Use uv to create an environment and install dependencies (including dev dependencies). Requires Python 3.12+.
```
uv venv # Creates .venv
uv sync --dev # Syncs based on pyproject.toml
```
(Alternatively, if you prefer manual venv: python3.12 -m venv .venv, source .venv/bin/activate, then uv pip install -e .[dev])
Activate the environment:
- macOS/Linux: source .venv/bin/activate
- Windows: .venv\Scripts\activate

Now you can run the tool using devdocs-crawler from within the activated environment.

You can run linters and formatters:

ruff check .
ruff format .

And run tests (if/when tests are added):

pytest

Building and Publishing (using uv)

Ensure your pyproject.toml has the correct version number and author details.
Build the distributions:
```
uv build
```
This creates wheel and source distributions in the dist/ directory.
Publish to PyPI (requires a PyPI account and an API token configured with uv):
```
uv publish
```
You can also publish to TestPyPI using uv publish --repository testpypi. See uv publish --help for more options, including providing tokens via environment variables or arguments.

Contributing

Contributions are welcome! Please see the CONTRIBUTING.md file for guidelines (if one exists).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
src/devdocs_crawler		src/devdocs_crawler
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DevDoc Crawler

Features

Installation (Recommended: pipx)

Alternative Installation (pip)

Usage

Development

Building and Publishing (using uv)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

youssef-tharwat/devdocs-crawler

Folders and files

Latest commit

History

Repository files navigation

DevDoc Crawler

Features

Installation (Recommended: pipx)

Alternative Installation (pip)

Usage

Development

Building and Publishing (using uv)

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages