Skip to content

A command-line tool to scrape a Confluence space, save pages as HTML, Markdown, and PDF, and then merge the PDFs into a single document with a table of contents.

License

Notifications You must be signed in to change notification settings

blurayne/confluence-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Confluence Scraper

About

A command-line tool to scrape a Confluence space using Playwright with chromeless chromium, save pages as HTML, Markdown, and PDF, and then merge the PDFs into a single document with a table of contents.

Currently dumpd formats:

  • PDF
  • Markdown
  • HTML
  • Raw (Confluence Storage Format)

TODO

  • Support downloading/embedding attachments

Setup

First install Go, then Playwright Driver

# Install Browser backends
go run github.com/playwright-community/playwright-go/cmd/playwright install chromium-headless-shell firefox

# Ubunut: you may need firstto install libavif13 
sudo apt install libavif13

# if above doesn't work
npx playwright install-deps 

then build and run

# build
go build 

# run
./confluence-dumper

Usage

$ ./confluence-scraper --help
rovided via command-line flags, environment variables,
or a config file (default: $HOME/.confluence-scraper.yaml).

Environment variables are prefixed with 'CONFLUENCE_'. For example, 
'--confluence-base-url' can be set via 'CONFLUENCE_BASE_URL', and 
'--token' via 'CONFLUENCE_TOKEN'. Other flags follow the pattern: 
'--output-dir' -> 'CONFLUENCE_OUTPUT_DIR', etc.

Usage:
  confluence-scraper [command]

Available Commands:
  check       Checks the integrity of the downloaded Confluence data.
  completion  Generate the autocompletion script for the specified shell
  grab        Grabs pages from Confluence and saves them locally.
  help        Help about any command
  toc         Generates a Table of Contents and merges PDFs.

Flags:
      --api-rps int                  Max number of API requests per second (default 4)
      --api-threads int              Max concurrent API threads (default 4)
      --config string                config file (default is $HOME/.confluence-scraper.yaml)
      --confluence-base-url string   Confluence base URL (without trailing slashes!)
      --headless                     Run browser in headless mode (default true)
  -h, --help                         help for confluence-scraper
      --keep-open                    Keep browser open after grabbing a page
      --max-pages int                Maximum concurrent browser pages for PDF generation (default 3)
      --no-headless                  Force the chromeless browser to run in windowed (non-headless) mode.
      --output-dir string            Output directory (default "output")
      --pdf-tool string              Tool for PDF merging (qpdf or pdftk) (default "pdftk")
      --token string                 Confluence API token

Use "confluence-scraper [command] --help" for more information about a command.

You can also setup ENV for the "easy job"

export CONFLUENCE_TOKEN=MDAzO5YyNjE…VVX233GYD+y7aGDaidi
export CONFLUENCE_BASE_URL=https://internal.mylab.com/confluence

Then start grabbing the content

./confluence-dumper grab -page-key SPACEKEY
./confluence-dumper grab -page-id ID

Before continuing an Integrity check us advused

./confluence-dumper check
2025/12/01 12:17:52 Running pre-TOC data integrity check...
2025/12/01 12:17:56 --- Starting Data Integrity Check ---
2025/12/01 12:17:56 ⚠️ Data integrity check found missing files.
2025/12/01 12:17:56 FATAL: Found 44 page IDs referenced as children but missing JSON metadata (cache). Please re-run the scraping pass (-target) to fetch them. Missing IDs: [174392216 297269769 302877016 315643535 315643647 318924604 343223213 350012244 350921365 371116443 373913113 374156685 380580750 395825411 395825413 395825418 395825450 395978352 395989302 403657165 404359645 431303635 474096569 621155835 621155837 621160489 791253140 791253146 803451602 803452375 803454118 805938720 94715701 94715778 94715987 94716031 94724743 94724828 94724833 94724846 94724863 94724897 94724903 94724940]

If integrity fails just re-run or fix the code first

./confluence-dumper grab -page-key SPACEKEY

After you got the JSON generate the PDF with the toc-pass.

./confluence-dumper toc

For generating the PDF we currently use CLI tools like qpdf or pdftk (default). Currently only propriety go-libs do support merging and bookmarks at the moment.

./confluence-dumper toc -pdf-tool qpdf 
./confluence-dumper toc -pdf-tool pdftk

About

A command-line tool to scrape a Confluence space, save pages as HTML, Markdown, and PDF, and then merge the PDFs into a single document with a table of contents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages