A Node.js application that scrapes all published blog posts from a Medium profile and downloads them as markdown files with YAML frontmatter metadata.
- 🔐 Google OAuth Authentication - Handles Medium's SSO requirement
- 📄 Complete Post Export - Downloads all published posts from any Medium profile
- 🔄 URL Format Support - Works with both
medium.com/@usernameandusername.medium.comformats - 📝 Markdown Conversion - Converts HTML content to clean markdown with frontmatter
- 🖼️ Image Download - Downloads and organizes all images locally
- 📊 Rich Metadata - Captures titles, dates, tags, authors, and more
- 🏗️ Organized Output - Creates structured directory layout for posts and images
- ⚡ Respectful Scraping - Includes delays and proper user agents
# Install dependencies
npm install
# Authenticate with Google OAuth
npm start auth
# Get a summary of posts (without downloading)
npm start summary https://medium.com/@username
# Scrape all posts from a profile
npm start scrape https://medium.com/@username- Node.js 16+
- npm or yarn
- Google OAuth credentials (see Setup Guide)
-
Clone and install:
git clone <repository-url> cd medium-scraper npm install
-
Configure Google OAuth:
Create a
.envfile in the project root:GOOGLE_CLIENT_ID=your_client_id_here GOOGLE_CLIENT_SECRET=your_client_secret_here GOOGLE_REDIRECT_URI=http://localhost:8080/oauth/callback
-
Test the installation:
npm test npm run quality:check
The scraper provides several commands for different operations:
# Check authentication status
npm start status
# Authenticate with Google OAuth (required first)
npm start auth
# Get profile summary (quick overview without downloading)
npm start summary <profile-url>
# Scrape all posts from a profile (full download)
npm start scrape <profile-url>
# Scrape only new/updated posts (incremental mode)
npm start incremental <profile-url>The scraper automatically handles both Medium URL formats:
# Standard Medium format
npm start scrape https://medium.com/@username
# Custom domain format (automatically converted)
npm start scrape https://username.medium.com
# Works with any Medium profile
npm start scrape https://medium.com/@real-username-
First-time setup:
npm start auth
This opens your browser to authenticate with Google.
-
Quick preview:
npm start summary https://medium.com/@username
Shows how many posts will be downloaded without actually downloading them.
-
Full scrape:
npm start scrape https://medium.com/@username
Downloads all posts and images to the
output/directory. -
Incremental updates:
npm start incremental https://medium.com/@username
Downloads only new or updated posts since the last scrape.
The scraper creates an organized directory structure with each post in its own folder:
output/
├── post-title-slug/
│ ├── post-title-slug.md
│ └── images/
│ ├── post-title-slug-featured.jpg
│ ├── post-title-slug-01.jpg
│ └── post-title-slug-02.png
├── another-post-slug/
│ ├── another-post-slug.md
│ └── images/
│ └── another-post-slug-featured.jpg
└── metadata.json
Each post is saved as markdown with YAML frontmatter:
---
title: 'How to Build Amazing Web Apps'
subtitle: 'A comprehensive guide to modern development'
date: '2024-01-15T10:30:00Z'
lastModified: '2024-01-16T14:22:00Z'
author: 'John Developer'
tags: ['javascript', 'web-development', 'tutorial']
featuredImage: './images/how-to-build-amazing-web-apps-featured.jpg'
published: true
---
# How to Build Amazing Web Apps
Post content in clean markdown format...

*Caption preserved from original*To scrape Medium posts, you need Google OAuth credentials since Medium uses Google SSO.
- Go to Google Cloud Console
- Create a new project or select existing one
- Enable the Google+ API or People API
- Go to APIs & Services > Credentials
- Click Create Credentials > OAuth client ID
- Choose Web application
- Add authorized redirect URI:
http://localhost:8080/oauth/callback - Download the credentials JSON
Create .env file with your credentials:
GOOGLE_CLIENT_ID=123456789-abc123.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-client-secret-here
GOOGLE_REDIRECT_URI=http://localhost:8080/oauth/callback| Variable | Description | Required | Default |
|---|---|---|---|
GOOGLE_CLIENT_ID |
Google OAuth client ID | Yes | - |
GOOGLE_CLIENT_SECRET |
Google OAuth client secret | Yes | - |
GOOGLE_REDIRECT_URI |
OAuth redirect URI | Yes | http://localhost:8080/oauth/callback |
OUTPUT_DIR |
Custom output directory | No | ./output |
MAX_SCROLL_ATTEMPTS |
Max pagination attempts | No | 20 |
You can customize the scraping behavior by modifying the options in src/main.js:
const result = await scraper.scrapeProfile(profileUrl, {
maxScrollAttempts: 10, // How many times to scroll for more posts
headless: true, // Run browser in headless mode
fastMode: false, // Enable fast mode for testing (reduces delays)
debug: false, // Enable debug mode with screenshots and logging
incremental: false, // Only download new/updated posts
})src/
├── auth.js # Google OAuth handling
├── scraper.js # Medium content extraction
├── converter.js # HTML to markdown conversion
├── storage.js # File system operations
├── config.js # Configuration management
├── utils.js # Utility functions and logging
├── error-handling.js # Error handling and retry logic
└── main.js # CLI interface and orchestration
test/
├── acceptance/ # BDD-style acceptance tests
└── unit/ # Unit tests
features/
└── medium-scraper.feature # Gherkin specifications
# Run all tests
npm test
# Run acceptance tests only
npm run test:acceptance
# Run tests with coverage
npm run test:coverage
# Watch mode for development
npm run test:watchThe project enforces strict code quality standards:
# Check code style and formatting
npm run quality:check
# Auto-fix style issues
npm run quality
# Run linting only
npm run lint
# Run formatting only
npm run format- No semicolons (enforced by Prettier)
- Functional programming patterns only
- Arrow functions exclusively
- ES modules (import/export)
- Comprehensive error handling
Authentication Errors:
# Check auth status
npm start status
# Re-authenticate if needed
npm start authEmpty Results:
- Verify the Medium profile URL is correct
- Check that the profile has published posts
- Ensure you're authenticated
Download Failures:
- Check internet connection
- Verify Google OAuth credentials
- Try running with smaller
maxScrollAttempts
Image Download Issues:
- Some images may be hosted on external CDNs with restrictions
- Check the console output for specific image download errors
- Images that fail to download will be noted in the metadata
For detailed logging, you can modify the logger in src/main.js or check the output in output/metadata.json for detailed operation results.
- Check the troubleshooting section
- Review the test files in
test/acceptance/for usage examples - Look at the Gherkin scenarios in
features/medium-scraper.feature - Open an issue if you find a bug
You can also use the scraper programmatically:
import { createMediumScraper } from './src/main.js'
const scraper = createMediumScraper()
// Authenticate
await scraper.auth.authenticate()
// Get summary
const summary = await scraper.getProfileSummary('https://medium.com/@username')
// Full scrape
const result = await scraper.scrapeProfile('https://medium.com/@username', {
maxScrollAttempts: 5,
})
console.log(`Scraped ${result.postsProcessed} posts successfully`)
// Incremental scrape
const incrementalResult = await scraper.scrapeProfile(
'https://medium.com/@username',
{
incremental: true,
}
)
console.log(`Updated ${incrementalResult.postsProcessed} posts`)- Requires Google OAuth authentication
- Rate limited to be respectful to Medium's servers
- Some private posts may not be accessible
- Custom Medium domains may have different layouts
- Images hosted on external CDNs may have download restrictions
This project is provided as-is for educational and personal backup purposes. Please respect Medium's Terms of Service and only scrape content you have permission to access.
- Follow the existing code style (functional programming, no semicolons)
- Write acceptance tests for new features
- Ensure all tests pass:
npm test && npm run quality:check - Follow the ATDD (Acceptance Test Driven Development) workflow outlined in
CLAUDE.md