A command-line tool to scrape a Confluence space using Playwright with chromeless chromium, save pages as HTML, Markdown, and PDF, and then merge the PDFs into a single document with a table of contents.
Currently dumpd formats:
- Markdown
- HTML
- Raw (Confluence Storage Format)
- Support downloading/embedding attachments
First install Go, then Playwright Driver
# Install Browser backends
go run github.com/playwright-community/playwright-go/cmd/playwright install chromium-headless-shell firefox
# Ubunut: you may need firstto install libavif13
sudo apt install libavif13
# if above doesn't work
npx playwright install-deps then build and run
# build
go build
# run
./confluence-dumper$ ./confluence-scraper --help
rovided via command-line flags, environment variables,
or a config file (default: $HOME/.confluence-scraper.yaml).
Environment variables are prefixed with 'CONFLUENCE_'. For example,
'--confluence-base-url' can be set via 'CONFLUENCE_BASE_URL', and
'--token' via 'CONFLUENCE_TOKEN'. Other flags follow the pattern:
'--output-dir' -> 'CONFLUENCE_OUTPUT_DIR', etc.
Usage:
confluence-scraper [command]
Available Commands:
check Checks the integrity of the downloaded Confluence data.
completion Generate the autocompletion script for the specified shell
grab Grabs pages from Confluence and saves them locally.
help Help about any command
toc Generates a Table of Contents and merges PDFs.
Flags:
--api-rps int Max number of API requests per second (default 4)
--api-threads int Max concurrent API threads (default 4)
--config string config file (default is $HOME/.confluence-scraper.yaml)
--confluence-base-url string Confluence base URL (without trailing slashes!)
--headless Run browser in headless mode (default true)
-h, --help help for confluence-scraper
--keep-open Keep browser open after grabbing a page
--max-pages int Maximum concurrent browser pages for PDF generation (default 3)
--no-headless Force the chromeless browser to run in windowed (non-headless) mode.
--output-dir string Output directory (default "output")
--pdf-tool string Tool for PDF merging (qpdf or pdftk) (default "pdftk")
--token string Confluence API token
Use "confluence-scraper [command] --help" for more information about a command.You can also setup ENV for the "easy job"
export CONFLUENCE_TOKEN=MDAzO5YyNjE…VVX233GYD+y7aGDaidi
export CONFLUENCE_BASE_URL=https://internal.mylab.com/confluenceThen start grabbing the content
./confluence-dumper grab -page-key SPACEKEY
./confluence-dumper grab -page-id IDBefore continuing an Integrity check us advused
./confluence-dumper check
2025/12/01 12:17:52 Running pre-TOC data integrity check...
2025/12/01 12:17:56 --- Starting Data Integrity Check ---
2025/12/01 12:17:56 ⚠️ Data integrity check found missing files.
2025/12/01 12:17:56 FATAL: Found 44 page IDs referenced as children but missing JSON metadata (cache). Please re-run the scraping pass (-target) to fetch them. Missing IDs: [174392216 297269769 302877016 315643535 315643647 318924604 343223213 350012244 350921365 371116443 373913113 374156685 380580750 395825411 395825413 395825418 395825450 395978352 395989302 403657165 404359645 431303635 474096569 621155835 621155837 621160489 791253140 791253146 803451602 803452375 803454118 805938720 94715701 94715778 94715987 94716031 94724743 94724828 94724833 94724846 94724863 94724897 94724903 94724940]If integrity fails just re-run or fix the code first
./confluence-dumper grab -page-key SPACEKEYAfter you got the JSON generate the PDF with the toc-pass.
./confluence-dumper tocFor generating the PDF we currently use CLI tools like qpdf or pdftk (default). Currently only propriety go-libs do support merging and bookmarks at the moment.
./confluence-dumper toc -pdf-tool qpdf
./confluence-dumper toc -pdf-tool pdftk