vmcrawl is a Mastodon-focused version reporting crawler.
It is written in Python, with a PostgreSQL database backend.
It performs periodic polling of known Mastodon instances to track version information, user counts, and security patch status.
- Python 3.13 or higher
- PostgreSQL database
- UV
For development or testing, you can quickly set up vmcrawl in the current directory:
git clone https://github.com/vmstio/vmcrawl.git
cd vmcrawl
uv sync
./vmcrawl.shFor production deployments, follow these steps to install vmcrawl as a system service:
# Clone application files
git clone https://github.com/vmstio/vmcrawl.git /opt/vmcrawl# Create vmcrawl user and set ownership
useradd -r -s /bin/bash -d /opt/vmcrawl vmcrawl
chown -R vmcrawl:vmcrawl /opt/vmcrawl# Switch to vmcrawl user
sudo -u vmcrawl -i
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Exit and log back in to refresh the PATH
exit
sudo -u vmcrawl -i
# Create virtual environment and install dependencies
cd /opt/vmcrawl
uv sync
# Exit vmcrawl user
exitsudo -u vmcrawl vim /opt/vmcrawl/.envAdd your configuration:
VMCRAWL_POSTGRES_DATA="dbname"
VMCRAWL_POSTGRES_USER="username"
VMCRAWL_POSTGRES_PASS="password"
VMCRAWL_POSTGRES_HOST="localhost"
VMCRAWL_POSTGRES_PORT="5432"On your PostgreSQL server, execute the contents of creation.sql to create the required tables.
# Make the shell scripts executable
chmod +x /opt/vmcrawl/vmcrawl.sh
chmod +x /opt/vmcrawl/vmfetch.sh
# Copy service files to systemd
cp /opt/vmcrawl/vmcrawl.service /etc/systemd/system/
cp /opt/vmcrawl/vmfetch.service /etc/systemd/system/
cp /opt/vmcrawl/vmfetch.timer /etc/systemd/system/
# Reload systemd
systemctl daemon-reload# Enable crawler service to start on boot
systemctl enable vmcrawl.service
# Start the crawler service
systemctl start vmcrawl.service
# Check crawler status
systemctl status vmcrawl.service
# Enable and start the vmfetch timer (runs hourly)
systemctl enable vmfetch.timer
systemctl start vmfetch.timer
# Check vmfetch timer status
systemctl status vmfetch.timer
systemctl list-timers vmfetch.timerYou can also run vmcrawl using Docker:
docker build -t vmcrawl .
docker run -d --name vmcrawl --env-file .env vmcrawlThe project includes four main scripts:
| Script | Purpose |
|---|---|
crawler.py |
Main crawling engine that processes domains, collects version/user data, and generates statistics |
fetch.py |
Fetches new domains from federated instance peer lists |
nightly.py |
Manages nightly/development version tracking in the database |
dni.py |
Fetches and manages IFTAS DNI (Do Not Interact) list of blocked domains |
Automated Fetching:
The vmfetch.timer systemd timer automatically runs fetch.py --random every hour to continuously discover new instances from random servers in your database. This ensures your instance list stays up-to-date without manual intervention. The timer starts one hour after system boot and runs hourly thereafter.
Statistics Generation:
Statistics are automatically generated and recorded by the main crawler (crawler.py) during its crawling operations. Historical statistics tracking is integrated into the crawling workflow, eliminating the need for a separate statistics service.
To start using vmcrawl you will need to populate your database with instances to crawl. You can fetch a list of fediverse instances from an existing Mastodon instance:
Native:
./vmfetch.shDocker:
docker exec vmcrawl ./vmfetch.shThe first time this is launched it will default to polling vmst.io for instances to crawl.
If you wish to override this you can target a specific instance:
Native:
./vmfetch.sh --target example.socialDocker:
docker exec vmcrawl ./vmfetch.sh --target example.socialOnce you have established a set of known good Mastodon instances, you can use them to fetch new federated instances:
Native:
./vmfetch.shDocker:
docker exec vmcrawl ./vmfetch.shThis will scan the top 10 instances in your database by total users.
You can change the limits or offset the domain list from the top:
Native:
./vmfetch.sh --limit 100 --offset 50Docker:
docker exec vmcrawl ./vmfetch.sh --limit 100 --offset 50You can use limit and offset together, or individually, but neither option can be combined with the target argument.
Unless you specifically target a server, fetch.py will only attempt to fetch from instances with over 100 active users.
If a server fails to fetch, it will be added to a no_peers table and not attempt to fetch new instances from it in the future.
You can also select a random sampling of servers to fetch from, instead of going by user count:
Native:
./vmfetch.sh --randomDocker:
docker exec vmcrawl ./vmfetch.sh --randomYou can combine random with the limit command, but not with target or offset.
After you have a list of instances to crawl, run the following command:
Native:
./vmcrawl.shDocker:
docker exec -it vmcrawl ./vmcrawl.shSelecting 0 from the interactive menu will begin to process all of your fetched domains.
You can customize the crawling process with the following options:
Process new domains:
0Recently Fetched
Change process direction:
1Standard Alphabetical List2Reverse Alphabetical List3Random Order (this is the default option for headless runs)
Retry fatal errors:
6Other Platforms (non-Mastodon instances)7Rejected (HTTP 410/418 errors)8Failed (NXDOMAIN/emoji domains)9Crawling Prohibited (robots.txt blocks)
Retry connection errors:
10SSL (certificate errors)11HTTP (general HTTP errors)12TCP (timeouts, connection issues)13MAX (maximum redirects exceeded)14DNS (name resolution failures)
Retry HTTP errors:
202xx status codes213xx status codes224xx status codes235xx status codes
Retry specific errors:
30JSON parsing errors31TXT/plain text response errors32API errors
Retry known instances:
40Unpatched (instances not running latest patches)41Main (instances on development/main branch)42Development (instances running alpha, beta, or rc versions)43Inactive (0 active monthly users)44All Good (all known instances)45Misreporting (instances with invalid version data)
Retry general errors:
50Domains with >14 Errors51Domains with 7-14 Errors
By default, when the script is run headless it will do a random crawl of instances in the database.
To limit what is crawled in headless mode, use the following arguments:
--newwill function like option0, and only process new domains recently fetched.
You can target a specific domain to fetch or crawl with the target option:
Native:
./vmcrawl.sh --target vmst.ioDocker:
docker exec -it vmcrawl ./vmcrawl.sh --target vmst.ioYou can include multiple domains in a comma-separated list:
Native:
./vmcrawl.sh --target mas.to,infosec.exchangeDocker:
docker exec -it vmcrawl ./vmcrawl.sh --target mas.to,infosec.exchangeYou can also process multiple domains using an external file, which contains each domain on a new line:
Native:
./vmcrawl.sh --file ~/domains.txtDocker:
docker exec -it vmcrawl ./vmcrawl.sh --file /opt/vmcrawl/domains.txtThe nightly.py script manages tracking of development/nightly versions:
Native:
uv run nightly.pyDocker:
docker exec -it vmcrawl uv run nightly.pyThis displays current nightly version entries and allows you to add new versions as they are released. Nightly versions are used to identify instances running pre-release software (alpha, beta, rc versions).
The dni.py script fetches and manages the IFTAS DNI (Do Not Interact) list:
Fetch and import DNI list:
Native:
uv run dni.pyDocker:
docker exec vmcrawl uv run dni.pyList all DNI domains:
Native:
uv run dni.py --listDocker:
docker exec vmcrawl uv run dni.py --listCount DNI domains:
Native:
uv run dni.py --countDocker:
docker exec vmcrawl uv run dni.py --countUse custom CSV URL:
Native:
uv run dni.py --url https://example.com/custom-dni-list.csvDocker:
docker exec vmcrawl uv run dni.py --url https://example.com/custom-dni-list.csvThe DNI list is sourced from IFTAS (Independent Federated Trust & Safety) and contains domains that have been identified for various trust and safety concerns. All domains imported from the IFTAS list are tagged with the comment "iftas" in the database.
You will need to maintain the environment variable VMCRAWL_BACKPORTS in a comma-separated list with the branches you wish to maintain backport information for.
Example:
VMCRAWL_BACKPORTS="4.5,4.4,4.3,4.2"For production installations using systemd:
# Follow crawler logs in real-time
journalctl -u vmcrawl.service -f
# View recent crawler logs
journalctl -u vmcrawl.service -n 100
# View crawler logs since boot
journalctl -u vmcrawl.service -b
# Follow vmfetch logs in real-time
journalctl -u vmfetch.service -f
# View recent vmfetch logs
journalctl -u vmfetch.service -n 100Crawler Service:
# Stop service
systemctl stop vmcrawl.service
# Restart service
systemctl restart vmcrawl.service
# Disable service
systemctl disable vmcrawl.serviceFetch Timer:
# Stop timer
systemctl stop vmfetch.timer
# Restart timer
systemctl restart vmfetch.timer
# Disable timer
systemctl disable vmfetch.timer
# Manually trigger a fetch
systemctl start vmfetch.service
# Check when the next fetch will run
systemctl list-timers vmfetch.timer- Check logs:
journalctl -u vmcrawl.service -n 50 - Verify permissions:
ls -la /opt/vmcrawl - Test script manually:
sudo -u vmcrawl /opt/vmcrawl/vmcrawl.sh
# Fix ownership
chown -R vmcrawl:vmcrawl /opt/vmcrawl
# Fix script permissions
chmod +x /opt/vmcrawl/vmcrawl.shWe welcome contributions! Please read our contributing guidelines for more details.
This project is licensed under the MIT License. See the LICENSE file for more information.
For any questions or feedback, please open an issue on GitHub.