A production-ready, serverless ETL pipeline for processing e-commerce transaction data on AWS
Features • Architecture • Getting Started • Usage • License
This project implements a fully automated, serverless ETL (Extract, Transform, Load) pipeline for processing e-commerce transaction data using AWS services. Built with infrastructure-as-code principles using Terraform, it demonstrates best practices for data engineering on AWS.
The pipeline continuously generates realistic e-commerce order data, processes it using AWS Glue, stores it in a partitioned data lake on S3, and enables SQL analytics through Amazon Athena.
-
Automated Data Generation - EC2 instance generates realistic e-commerce orders every 5 minutes
-
Scheduled ETL Processing - AWS Glue job runs hourly to transform and partition data
-
Partitioned Data Lake - Efficient data storage partitioned by
region/year/month -
SQL Analytics - Query your data using Amazon Athena with standard SQL
-
Infrastructure as Code - Complete AWS infrastructure provisioned via Terraform
-
Cost-Effective - Serverless architecture minimizes operational costs
The pipeline follows a modern data lakehouse architecture:
- Data Generation Layer (EC2) - Simulates real-time e-commerce transactions
- Storage Layer (S3) - Raw and processed data stored in partitioned structure
- Processing Layer (AWS Glue) - Serverless Spark jobs for ETL transformations
- Query Layer (Amazon Athena) - SQL interface for data analysis
The pipeline processes e-commerce orders with the following structure:
{
"metadata": {
"source_system": "String",
"ingestion_timestamp": "Timestamp",
"schema_version": "String"
},
"payload": {
"order_id": "Integer",
"customer_id": "Integer",
"product_id": "Integer",
"amount": "String",
"currency": "String",
"event_timestamp": "String",
"region": "String"
}
}
Before you begin, ensure you have the following installed:
- AWS CLI - For AWS authentication
- Terraform - Infrastructure provisioning
- Python 3.12 - For local development
- An AWS account with appropriate permissions
git clone https://github.com/BALK-03/ecommerce-etl-aws.git
cd ecommerce-etl-awsCreate an IAM user with the following permissions:
AmazonS3FullAccessIAMFullAccessAWSGlueServiceRoleAmazonAthenaFullAccessAmazonEC2FullAccess
Create an Access Key for the IAM user, then configure AWS CLI:
aws configureEnter your AWS Access Key ID, Secret Access Key, default region, and output format when prompted.
Create a .env file from the template:
cp .env.template .envEdit .env and fill in your configuration.
Use the Makefile to deploy all AWS resources:
make deployThis command will:
- Initialize Terraform
- Plan the infrastructure changes
- Apply the configuration to create all AWS resources
- Set up the EC2 data generator
- Configure the Glue ETL job
- Create Athena database and tables
After deployment completes:
- Check EC2 Instance - Verify data generation is running
- Monitor S3 Buckets - Confirm data is being written every 5 minutes
- View Glue Jobs - Check the ETL job in AWS Glue console
- Test Athena - Run a sample query to verify the pipeline
Create your SQL queries in the queries/ directory. Execute queries using the Makefile:
make query-athenaThis project is licensed under the MIT License - see the LICENSE file for details.
