vmcrawl

Overview

vmcrawl is a Mastodon-focused version reporting crawler. It is written in Python, with a PostgreSQL database backend. It performs periodic polling of known Mastodon instances to track version information, user counts, and security patch status.

Requirements

Python 3.13 or higher
PostgreSQL database
UV

Installation

Quick Start (Development)

For development or testing, you can quickly set up vmcrawl in the current directory:

git clone https://github.com/vmstio/vmcrawl.git
cd vmcrawl
uv sync
./vmcrawl.sh

Production Installation

For production deployments, follow these steps to install vmcrawl as a system service:

1. Install Application

# Clone application files
git clone https://github.com/vmstio/vmcrawl.git /opt/vmcrawl

2. Create System User

# Create vmcrawl user and set ownership
useradd -r -s /bin/bash -d /opt/vmcrawl vmcrawl
chown -R vmcrawl:vmcrawl /opt/vmcrawl

3. Set Up Virtual Environment

# Switch to vmcrawl user
sudo -u vmcrawl -i

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Exit and log back in to refresh the PATH
exit
sudo -u vmcrawl -i

# Create virtual environment and install dependencies
cd /opt/vmcrawl
uv sync

# Exit vmcrawl user
exit

4. Configure Environment Variables

sudo -u vmcrawl vim /opt/vmcrawl/.env

Add your configuration:

VMCRAWL_POSTGRES_DATA="dbname"
VMCRAWL_POSTGRES_USER="username"
VMCRAWL_POSTGRES_PASS="password"
VMCRAWL_POSTGRES_HOST="localhost"
VMCRAWL_POSTGRES_PORT="5432"

On your PostgreSQL server, execute the contents of creation.sql to create the required tables.

5. Install Service Files

# Make the shell scripts executable
chmod +x /opt/vmcrawl/vmcrawl.sh
chmod +x /opt/vmcrawl/vmfetch.sh

# Copy service files to systemd
cp /opt/vmcrawl/vmcrawl.service /etc/systemd/system/
cp /opt/vmcrawl/vmfetch.service /etc/systemd/system/
cp /opt/vmcrawl/vmfetch.timer /etc/systemd/system/

# Reload systemd
systemctl daemon-reload

6. Enable and Start Services

# Enable crawler service to start on boot
systemctl enable vmcrawl.service

# Start the crawler service
systemctl start vmcrawl.service

# Check crawler status
systemctl status vmcrawl.service

# Enable and start the vmfetch timer (runs hourly)
systemctl enable vmfetch.timer
systemctl start vmfetch.timer

# Check vmfetch timer status
systemctl status vmfetch.timer
systemctl list-timers vmfetch.timer

Docker Installation

You can also run vmcrawl using Docker:

docker build -t vmcrawl .
docker run -d --name vmcrawl --env-file .env vmcrawl

Scripts

The project includes four main scripts:

Script	Purpose
`crawler.py`	Main crawling engine that processes domains, collects version/user data, and generates statistics
`fetch.py`	Fetches new domains from federated instance peer lists
`nightly.py`	Manages nightly/development version tracking in the database
`dni.py`	Fetches and manages IFTAS DNI (Do Not Interact) list of blocked domains

Automated Tasks

Automated Fetching:

The vmfetch.timer systemd timer automatically runs fetch.py --random every hour to continuously discover new instances from random servers in your database. This ensures your instance list stays up-to-date without manual intervention. The timer starts one hour after system boot and runs hourly thereafter.

Statistics Generation:

Statistics are automatically generated and recorded by the main crawler (crawler.py) during its crawling operations. Historical statistics tracking is integrated into the crawling workflow, eliminating the need for a separate statistics service.

Usage

Fetching Domains

To start using vmcrawl you will need to populate your database with instances to crawl. You can fetch a list of fediverse instances from an existing Mastodon instance:

Native:

./vmfetch.sh

Docker:

docker exec vmcrawl ./vmfetch.sh

The first time this is launched it will default to polling vmst.io for instances to crawl. If you wish to override this you can target a specific instance:

Native:

./vmfetch.sh --target example.social

Docker:

docker exec vmcrawl ./vmfetch.sh --target example.social

Once you have established a set of known good Mastodon instances, you can use them to fetch new federated instances:

Native:

./vmfetch.sh

Docker:

docker exec vmcrawl ./vmfetch.sh

This will scan the top 10 instances in your database by total users.

You can change the limits or offset the domain list from the top:

Native:

./vmfetch.sh --limit 100 --offset 50

Docker:

docker exec vmcrawl ./vmfetch.sh --limit 100 --offset 50

You can use limit and offset together, or individually, but neither option can be combined with the target argument.

Unless you specifically target a server, fetch.py will only attempt to fetch from instances with over 100 active users. If a server fails to fetch, it will be added to a no_peers table and not attempt to fetch new instances from it in the future.

You can also select a random sampling of servers to fetch from, instead of going by user count:

Native:

./vmfetch.sh --random

Docker:

docker exec vmcrawl ./vmfetch.sh --random

You can combine random with the limit command, but not with target or offset.

Crawling Instances

After you have a list of instances to crawl, run the following command:

Native:

./vmcrawl.sh

Docker:

docker exec -it vmcrawl ./vmcrawl.sh

Selecting 0 from the interactive menu will begin to process all of your fetched domains.

Menu Options

You can customize the crawling process with the following options:

Process new domains:

0 Recently Fetched

Change process direction:

1 Standard Alphabetical List
2 Reverse Alphabetical List
3 Random Order (this is the default option for headless runs)

Retry fatal errors:

6 Other Platforms (non-Mastodon instances)
7 Rejected (HTTP 410/418 errors)
8 Failed (NXDOMAIN/emoji domains)
9 Crawling Prohibited (robots.txt blocks)

Retry connection errors:

10 SSL (certificate errors)
11 HTTP (general HTTP errors)
12 TCP (timeouts, connection issues)
13 MAX (maximum redirects exceeded)
14 DNS (name resolution failures)

Retry HTTP errors:

20 2xx status codes
21 3xx status codes
22 4xx status codes
23 5xx status codes

Retry specific errors:

30 JSON parsing errors
31 TXT/plain text response errors
32 API errors

Retry known instances:

40 Unpatched (instances not running latest patches)
41 Main (instances on development/main branch)
42 Development (instances running alpha, beta, or rc versions)
43 Inactive (0 active monthly users)
44 All Good (all known instances)
45 Misreporting (instances with invalid version data)

Retry general errors:

50 Domains with >14 Errors
51 Domains with 7-14 Errors

Headless Crawling

By default, when the script is run headless it will do a random crawl of instances in the database.

To limit what is crawled in headless mode, use the following arguments:

--new will function like option 0, and only process new domains recently fetched.

Targeting

You can target a specific domain to fetch or crawl with the target option:

Native:

./vmcrawl.sh --target vmst.io

Docker:

docker exec -it vmcrawl ./vmcrawl.sh --target vmst.io

You can include multiple domains in a comma-separated list:

Native:

./vmcrawl.sh --target mas.to,infosec.exchange

Docker:

docker exec -it vmcrawl ./vmcrawl.sh --target mas.to,infosec.exchange

You can also process multiple domains using an external file, which contains each domain on a new line:

Native:

./vmcrawl.sh --file ~/domains.txt

Docker:

docker exec -it vmcrawl ./vmcrawl.sh --file /opt/vmcrawl/domains.txt

Nightly Version Management

The nightly.py script manages tracking of development/nightly versions:

Native:

uv run nightly.py

Docker:

docker exec -it vmcrawl uv run nightly.py

This displays current nightly version entries and allows you to add new versions as they are released. Nightly versions are used to identify instances running pre-release software (alpha, beta, rc versions).

DNI List Management

The dni.py script fetches and manages the IFTAS DNI (Do Not Interact) list:

Fetch and import DNI list:

Native:

uv run dni.py

Docker:

docker exec vmcrawl uv run dni.py

List all DNI domains:

Native:

uv run dni.py --list

Docker:

docker exec vmcrawl uv run dni.py --list

Count DNI domains:

Native:

uv run dni.py --count

Docker:

docker exec vmcrawl uv run dni.py --count

Use custom CSV URL:

Native:

uv run dni.py --url https://example.com/custom-dni-list.csv

Docker:

docker exec vmcrawl uv run dni.py --url https://example.com/custom-dni-list.csv

The DNI list is sourced from IFTAS (Independent Federated Trust & Safety) and contains domains that have been identified for various trust and safety concerns. All domains imported from the IFTAS list are tagged with the comment "iftas" in the database.

Configuration

Backport Branches

You will need to maintain the environment variable VMCRAWL_BACKPORTS in a comma-separated list with the branches you wish to maintain backport information for.

Example:

VMCRAWL_BACKPORTS="4.5,4.4,4.3,4.2"

Service Management

For production installations using systemd:

View Logs

# Follow crawler logs in real-time
journalctl -u vmcrawl.service -f

# View recent crawler logs
journalctl -u vmcrawl.service -n 100

# View crawler logs since boot
journalctl -u vmcrawl.service -b

# Follow vmfetch logs in real-time
journalctl -u vmfetch.service -f

# View recent vmfetch logs
journalctl -u vmfetch.service -n 100

Control Services

Crawler Service:

# Stop service
systemctl stop vmcrawl.service

# Restart service
systemctl restart vmcrawl.service

# Disable service
systemctl disable vmcrawl.service

Fetch Timer:

# Stop timer
systemctl stop vmfetch.timer

# Restart timer
systemctl restart vmfetch.timer

# Disable timer
systemctl disable vmfetch.timer

# Manually trigger a fetch
systemctl start vmfetch.service

# Check when the next fetch will run
systemctl list-timers vmfetch.timer

Troubleshooting

Service fails to start

Check logs: journalctl -u vmcrawl.service -n 50
Verify permissions: ls -la /opt/vmcrawl
Test script manually: sudo -u vmcrawl /opt/vmcrawl/vmcrawl.sh

Permission errors

# Fix ownership
chown -R vmcrawl:vmcrawl /opt/vmcrawl

# Fix script permissions
chmod +x /opt/vmcrawl/vmcrawl.sh

Contributing

We welcome contributions! Please read our contributing guidelines for more details.

License

This project is licensed under the MIT License. See the LICENSE file for more information.

Contact

For any questions or feedback, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 576 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
crawler.py		crawler.py
creation.sql		creation.sql
dashboard.json		dashboard.json
dni.py		dni.py
fetch.py		fetch.py
nightly.py		nightly.py
pyproject.toml		pyproject.toml
tests.txt		tests.txt
vmcrawl.service		vmcrawl.service
vmcrawl.sh		vmcrawl.sh
vmfetch.service		vmfetch.service
vmfetch.sh		vmfetch.sh
vmfetch.timer		vmfetch.timer

vmstio/vmcrawl

Folders and files

Latest commit

History

Repository files navigation

vmcrawl

Overview

Requirements

Installation

Quick Start (Development)

Production Installation

1. Install Application

2. Create System User

3. Set Up Virtual Environment

4. Configure Environment Variables

5. Install Service Files

6. Enable and Start Services

Docker Installation

Scripts

Automated Tasks

Usage

Fetching Domains

Crawling Instances

Menu Options

Headless Crawling

Targeting

Nightly Version Management

DNI List Management

Configuration

Backport Branches

Service Management

View Logs

Control Services

Troubleshooting

Service fails to start

Permission errors

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages