Ritual is a 4chan archiver that focuses on simplicity.
Notable features include,
- Built using Python 3.12+.
- Uses the Asagi schema
- Runs in a synchronous, step-by-step manner that's easy read.
- Minimal dependencies.
- Flexible configurations. You can choose whether you download text, thumbnails, and/or full media for each post.
- Sqlite and MySQL database support.
- Avoids downloading duplicate media files.
Ritual will create schemas for you. But note, in the future, when you need database tools, check out https://github.com/sky-cake/asagi-tables.
- Create a file called
configs.pyusingrename_to_configs.py, and configure it. - Create a venv and install dependencies,
uv venvsource .venv/bin/activateuv pip install -r requirements.txt- If you are using mysql, uncomment the mysql-connector package
- Ritual depends on https://github.com/sky-cake/asagi-tables for Sqlite and MySQL schema creation.
- Configure the variables in the
install_asagi_tables.shfile, and then run it.
- Configure the variables in the
python3.12 main.pyto run the scraper.
If you want the program to persist after leaving your shell, you can run Ritual using screen, likeso.
screen -S ritual(you might need tosudo apt install screen)python3.12 main.pyto run the scraper.ctrl-A,dto leave the screenscreen -r ritualto reattach to the screen
<board>_images.totalis not accurate. This arises from supporting partial media downloading.
sqlite3 /path/to/db "VACUUM INTO '/path/to/backup'"
sqlite3 /path/to/backup 'PRAGMA integrity_check' # optional
gzip /path/to/backup # optional
Here is how the flexible archive configurations work.
-
op_comment_min_charsandop_comment_min_chars_uniquefilter everything first. -
If a post is blacklisted and whitelisted, it will not be archived - blacklisted filters take precedence over whitelisted filters.
-
If only a blacklist is specified, skip blacklisted posts, and archive everything else.
-
If only a whitelist is specified, archive whitelisted posts, and skip everything else.
-
If no white/black lists are specified, archive everything.
-
If a thread is marked as "should archive" from the above rules, media downloads can be further filtered based on dl_th_, and dl_fm_ configs.
dl: downloadth: thumbfm: full_media
-
To download all/no media, specify True/False. To filter media, assign a regex pattern. Media can be filtered based on three levels.
op: OP mediathread: media in the whole threadpost: media per post
Here is an example from rename_to_configs.py,
boards = {
'g': {
'blacklist': '.*(local models).*', # If an OP contains "local models" in the subject or comment, then skip the thread.
'whitelist': '.*(home server|linux).*', # otherwise, for OPs with "home server" or "linux" in the subject or comment, apply the other configs.
'thread_text': True, # Archive the text? Blacklist and whitelist filters apply. Disable by setting {'thread_text': False}.
'dl_fm_thread': '.*(wireguard).*', # if a thread/OP mentions "wireguard", get its all the full media for the thread
'dl_fm_post': '.*(wireguard).*', # if a replies mentions "wireguard", get its the post's full media
'dl_fm_op': '.*(wireguard).*', # if an OP mentions "wireguard", get downloads the OP's full media
# Thumbnail downloads work the same way, but they are specified with dl_th_* instead of dl_fm_*
},
'gif': {
# This will only gather thread text from /gif/. No files.
# By default, we assume {'thread_text': True}
},
'ck': {
# Archive "Coffee Time General" threads with thumbnails only.
'whitelist': '.*Coffee Time General.*',
'thread_text': True,
'dl_th_post': True,
'dl_fm_post': False,
},
't': {
# Archive threads. Only download OP full media and OP thumbnails.
'thread_text': True,
'dl_fm_op': True,
'dl_th_op': True,
},
'biz': {
# Archive threads. No files.
'thread_text': True,
'op_comment_min_chars': 4, # Skips "omg" "." "lol"
'op_comment_min_chars_unique': 3, # Skips "lol" "hahaha" "aaaaa"
}
}