E-commerce Medallion Data Pipeline on AWS

A production-ready, serverless ETL pipeline for processing e-commerce transaction data on AWS

Features • Architecture • Getting Started • Usage • License

Overview

This project implements a fully automated, serverless ETL (Extract, Transform, Load) pipeline for processing e-commerce transaction data using AWS services. Built with infrastructure-as-code principles using Terraform, it demonstrates best practices for data engineering on AWS.

The pipeline continuously generates realistic e-commerce order data, processes it using AWS Glue, stores it in a partitioned data lake on S3, and enables SQL analytics through Amazon Athena.

Features

Automated Data Generation - EC2 instance generates realistic e-commerce orders every 5 minutes
Scheduled ETL Processing - AWS Glue job runs hourly to transform and partition data
Partitioned Data Lake - Efficient data storage partitioned by region/year/month
SQL Analytics - Query your data using Amazon Athena with standard SQL
Infrastructure as Code - Complete AWS infrastructure provisioned via Terraform
Cost-Effective - Serverless architecture minimizes operational costs

Architecture

The pipeline follows a modern data lakehouse architecture:

Data Generation Layer (EC2) - Simulates real-time e-commerce transactions
Storage Layer (S3) - Raw and processed data stored in partitioned structure
Processing Layer (AWS Glue) - Serverless Spark jobs for ETL transformations
Query Layer (Amazon Athena) - SQL interface for data analysis

Data Schema

The pipeline processes e-commerce orders with the following structure:

{
  "metadata": {
    "source_system": "String",
    "ingestion_timestamp": "Timestamp",
    "schema_version": "String"
  },
  "payload": {
    "order_id": "Integer",
    "customer_id": "Integer",
    "product_id": "Integer",
    "amount": "String",
    "currency": "String",
    "event_timestamp": "String",
    "region": "String"
  }
}

Prerequisites

Before you begin, ensure you have the following installed:

AWS CLI - For AWS authentication
Terraform - Infrastructure provisioning
Python 3.12 - For local development
An AWS account with appropriate permissions

Getting Started

1. Clone the Repository

git clone https://github.com/BALK-03/ecommerce-etl-aws.git
cd ecommerce-etl-aws

2. Set Up AWS Credentials

Create an IAM user with the following permissions:

AmazonS3FullAccess
IAMFullAccess
AWSGlueServiceRole
AmazonAthenaFullAccess
AmazonEC2FullAccess

Create an Access Key for the IAM user, then configure AWS CLI:

aws configure

Enter your AWS Access Key ID, Secret Access Key, default region, and output format when prompted.

3. Configure Environment Variables

Create a .env file from the template:

cp .env.template .env

Edit .env and fill in your configuration.

4. Deploy Infrastructure

Use the Makefile to deploy all AWS resources:

make deploy

This command will:

Initialize Terraform
Plan the infrastructure changes
Apply the configuration to create all AWS resources
Set up the EC2 data generator
Configure the Glue ETL job
Create Athena database and tables

5. Verify Deployment

After deployment completes:

Check EC2 Instance - Verify data generation is running
Monitor S3 Buckets - Confirm data is being written every 5 minutes
View Glue Jobs - Check the ETL job in AWS Glue console
Test Athena - Run a sample query to verify the pipeline

Usage

Running Athena Queries

Create your SQL queries in the queries/ directory. Execute queries using the Makefile:

make query-athena

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docs		docs
infra		infra
queries		queries
scripts		scripts
src		src
utils		utils
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E-commerce Medallion Data Pipeline on AWS

Overview

Features

Architecture

Data Schema

Prerequisites

Getting Started

1. Clone the Repository

2. Set Up AWS Credentials

3. Configure Environment Variables

4. Deploy Infrastructure

5. Verify Deployment

Usage

Running Athena Queries

License

About

Uh oh!

Releases

Packages

Languages

License

BALK-03/ecommerce-etl-aws

Folders and files

Latest commit

History

Repository files navigation

E-commerce Medallion Data Pipeline on AWS

Overview

Features

Architecture

Data Schema

Prerequisites

Getting Started

1. Clone the Repository

2. Set Up AWS Credentials

3. Configure Environment Variables

4. Deploy Infrastructure

5. Verify Deployment

Usage

Running Athena Queries

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages