Medium Blog Scraper

A Node.js application that scrapes all published blog posts from a Medium profile and downloads them as markdown files with YAML frontmatter metadata.

Features

🔐 Google OAuth Authentication - Handles Medium's SSO requirement
📄 Complete Post Export - Downloads all published posts from any Medium profile
🔄 URL Format Support - Works with both medium.com/@username and username.medium.com formats
📝 Markdown Conversion - Converts HTML content to clean markdown with frontmatter
🖼️ Image Download - Downloads and organizes all images locally
📊 Rich Metadata - Captures titles, dates, tags, authors, and more
🏗️ Organized Output - Creates structured directory layout for posts and images
⚡ Respectful Scraping - Includes delays and proper user agents

Quick Start

# Install dependencies
npm install

# Authenticate with Google OAuth
npm start auth

# Get a summary of posts (without downloading)
npm start summary https://medium.com/@username

# Scrape all posts from a profile
npm start scrape https://medium.com/@username

Installation

Prerequisites

Node.js 16+
npm or yarn
Google OAuth credentials (see Setup Guide)

Setup

Clone and install:

git clone <repository-url>
cd medium-scraper
npm install

Configure Google OAuth:

Create a .env file in the project root:

GOOGLE_CLIENT_ID=your_client_id_here
GOOGLE_CLIENT_SECRET=your_client_secret_here
GOOGLE_REDIRECT_URI=http://localhost:8080/oauth/callback

Test the installation:
```
npm test
npm run quality:check
```

Usage

CLI Commands

The scraper provides several commands for different operations:

# Check authentication status
npm start status

# Authenticate with Google OAuth (required first)
npm start auth

# Get profile summary (quick overview without downloading)
npm start summary <profile-url>

# Scrape all posts from a profile (full download)
npm start scrape <profile-url>

# Scrape only new/updated posts (incremental mode)
npm start incremental <profile-url>

Supported URL Formats

The scraper automatically handles both Medium URL formats:

# Standard Medium format
npm start scrape https://medium.com/@username

# Custom domain format (automatically converted)
npm start scrape https://username.medium.com

# Works with any Medium profile
npm start scrape https://medium.com/@real-username

Example Workflow

First-time setup:
```
npm start auth
```
This opens your browser to authenticate with Google.
Quick preview:
```
npm start summary https://medium.com/@username
```
Shows how many posts will be downloaded without actually downloading them.
Full scrape:
```
npm start scrape https://medium.com/@username
```
Downloads all posts and images to the output/ directory.
Incremental updates:
```
npm start incremental https://medium.com/@username
```
Downloads only new or updated posts since the last scrape.

Output Structure

The scraper creates an organized directory structure with each post in its own folder:

output/
├── post-title-slug/
│   ├── post-title-slug.md
│   └── images/
│       ├── post-title-slug-featured.jpg
│       ├── post-title-slug-01.jpg
│       └── post-title-slug-02.png
├── another-post-slug/
│   ├── another-post-slug.md
│   └── images/
│       └── another-post-slug-featured.jpg
└── metadata.json

Markdown Format

Each post is saved as markdown with YAML frontmatter:

---
title: 'How to Build Amazing Web Apps'
subtitle: 'A comprehensive guide to modern development'
date: '2024-01-15T10:30:00Z'
lastModified: '2024-01-16T14:22:00Z'
author: 'John Developer'
tags: ['javascript', 'web-development', 'tutorial']
featuredImage: './images/how-to-build-amazing-web-apps-featured.jpg'
published: true
---
# How to Build Amazing Web Apps

Post content in clean markdown format...

![Local Image](./images/how-to-build-amazing-web-apps-01.jpg)
*Caption preserved from original*

Google OAuth Setup

To scrape Medium posts, you need Google OAuth credentials since Medium uses Google SSO.

1. Create Google Cloud Project

Go to Google Cloud Console
Create a new project or select existing one
Enable the Google+ API or People API

2. Create OAuth Credentials

Go to APIs & Services > Credentials
Click Create Credentials > OAuth client ID
Choose Web application
Add authorized redirect URI: http://localhost:8080/oauth/callback
Download the credentials JSON

3. Configure Environment

Create .env file with your credentials:

GOOGLE_CLIENT_ID=123456789-abc123.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-client-secret-here
GOOGLE_REDIRECT_URI=http://localhost:8080/oauth/callback

Configuration

Environment Variables

Variable	Description	Required	Default
`GOOGLE_CLIENT_ID`	Google OAuth client ID	Yes	-
`GOOGLE_CLIENT_SECRET`	Google OAuth client secret	Yes	-
`GOOGLE_REDIRECT_URI`	OAuth redirect URI	Yes	http://localhost:8080/oauth/callback
`OUTPUT_DIR`	Custom output directory	No	./output
`MAX_SCROLL_ATTEMPTS`	Max pagination attempts	No	20

Customization Options

You can customize the scraping behavior by modifying the options in src/main.js:

const result = await scraper.scrapeProfile(profileUrl, {
  maxScrollAttempts: 10, // How many times to scroll for more posts
  headless: true, // Run browser in headless mode
  fastMode: false, // Enable fast mode for testing (reduces delays)
  debug: false, // Enable debug mode with screenshots and logging
  incremental: false, // Only download new/updated posts
})

Development

Project Structure

src/
├── auth.js          # Google OAuth handling
├── scraper.js       # Medium content extraction
├── converter.js     # HTML to markdown conversion
├── storage.js       # File system operations
├── config.js        # Configuration management
├── utils.js         # Utility functions and logging
├── error-handling.js # Error handling and retry logic
└── main.js          # CLI interface and orchestration

test/
├── acceptance/      # BDD-style acceptance tests
└── unit/           # Unit tests

features/
└── medium-scraper.feature  # Gherkin specifications

Running Tests

# Run all tests
npm test

# Run acceptance tests only
npm run test:acceptance

# Run tests with coverage
npm run test:coverage

# Watch mode for development
npm run test:watch

Code Quality

The project enforces strict code quality standards:

# Check code style and formatting
npm run quality:check

# Auto-fix style issues
npm run quality

# Run linting only
npm run lint

# Run formatting only
npm run format

Code Style

No semicolons (enforced by Prettier)
Functional programming patterns only
Arrow functions exclusively
ES modules (import/export)
Comprehensive error handling

Troubleshooting

Common Issues

Authentication Errors:

# Check auth status
npm start status

# Re-authenticate if needed
npm start auth

Empty Results:

Verify the Medium profile URL is correct
Check that the profile has published posts
Ensure you're authenticated

Download Failures:

Check internet connection
Verify Google OAuth credentials
Try running with smaller maxScrollAttempts

Image Download Issues:

Some images may be hosted on external CDNs with restrictions
Check the console output for specific image download errors
Images that fail to download will be noted in the metadata

Debug Mode

For detailed logging, you can modify the logger in src/main.js or check the output in output/metadata.json for detailed operation results.

Getting Help

Check the troubleshooting section
Review the test files in test/acceptance/ for usage examples
Look at the Gherkin scenarios in features/medium-scraper.feature
Open an issue if you find a bug

API Usage

You can also use the scraper programmatically:

import { createMediumScraper } from './src/main.js'

const scraper = createMediumScraper()

// Authenticate
await scraper.auth.authenticate()

// Get summary
const summary = await scraper.getProfileSummary('https://medium.com/@username')

// Full scrape
const result = await scraper.scrapeProfile('https://medium.com/@username', {
  maxScrollAttempts: 5,
})

console.log(`Scraped ${result.postsProcessed} posts successfully`)

// Incremental scrape
const incrementalResult = await scraper.scrapeProfile(
  'https://medium.com/@username',
  {
    incremental: true,
  }
)

console.log(`Updated ${incrementalResult.postsProcessed} posts`)

Limitations

Requires Google OAuth authentication
Rate limited to be respectful to Medium's servers
Some private posts may not be accessible
Custom Medium domains may have different layouts
Images hosted on external CDNs may have download restrictions

License

This project is provided as-is for educational and personal backup purposes. Please respect Medium's Terms of Service and only scrape content you have permission to access.

Contributing

Follow the existing code style (functional programming, no semicolons)
Write acceptance tests for new features
Ensure all tests pass: npm test && npm run quality:check
Follow the ATDD (Acceptance Test Driven Development) workflow outlined in CLAUDE.md

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.claude/agents		.claude/agents
.github/workflows		.github/workflows
.husky		.husky
features		features
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc.js		.prettierrc.js
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANUAL_TESTING.md		MANUAL_TESTING.md
README.md		README.md
SETUP.md		SETUP.md
debug-medium-page.png		debug-medium-page.png
eslint.config.js		eslint.config.js
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
prompt.md		prompt.md
start-claude.sh		start-claude.sh

License

bdfinst/medium-download

Folders and files

Latest commit

History

Repository files navigation

Medium Blog Scraper

Features

Quick Start

Installation

Prerequisites

Setup

Usage

CLI Commands

Supported URL Formats

Example Workflow

Output Structure

Markdown Format

Google OAuth Setup

1. Create Google Cloud Project

2. Create OAuth Credentials

3. Configure Environment

Configuration

Environment Variables

Customization Options

Development

Project Structure

Running Tests

Code Quality

Code Style

Troubleshooting

Common Issues

Debug Mode

Getting Help

API Usage

Limitations

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages