This project demonstrates a real-time stock market data streaming pipeline using Kafka, AWS EC2, S3, Glue, and other AWS services. The architecture is designed to simulate stock market data, stream it via Kafka, store it in S3, catalog it using AWS Glue, and query it with Amazon Athena.
- AWS Account
- EC2 Instance with necessary permissions
- Apache Kafka
- Python 3.x
- AWS CLI configured
- boto3 library
- Dataset for stock market simulation
-
Download and extract Kafka:
wget https://downloads.apache.org/kafka/3.3.1/kafka_2.12-3.7.0.tgz tar -xvf kafka_2.12-3.7.0.tgz
-
Install Java (if not already installed):
sudo yum install java-17-openjdk -y java -version
-
Start ZooKeeper:
cd kafka_2.12-3.7.0 bin/zookeeper-server-start.sh config/zookeeper.properties -
Start Kafka server (in a new terminal):
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M" cd kafka_2.12-3.7.0 bin/kafka-server-start.sh config/server.properties
-
Configure Kafka to use the public IP of your EC2 instance by editing
server.properties:sudo nano config/server.properties # Change ADVERTISED_LISTENERS to the public IP of the EC2 instance -
Create a Kafka topic:
bin/kafka-topics.sh --create --topic demo_testing2 --bootstrap-server {Public_IP_of_EC2_Instance:9092} --replication-factor 1 --partitions 1 -
Start a Kafka producer:
bin/kafka-console-producer.sh --topic demo_testing2 --bootstrap-server {Public_IP_of_EC2_Instance:9092} -
Start a Kafka consumer (in a new terminal):
bin/kafka-console-consumer.sh --topic demo_testing2 --bootstrap-server {Public_IP_of_EC2_Instance:9092}
Use the provided Python scripts to simulate stock market data and produce messages to the Kafka topic.
- KafkaProducer.ipynb: Contains the producer logic for streaming stock market data.
- KafkaConsumer.ipynb: Contains the consumer logic for reading streamed data.
Configure your AWS S3 bucket and use Boto3 to store the Kafka data:
- Ensure your EC2 instance has the necessary IAM role with S3 permissions.
- Use the Boto3 library in your consumer script to upload data to S3.
-
Create a Glue Crawler:
- Set the S3 bucket as the data source.
- Run the crawler to catalog the data.
-
Use AWS Glue Data Catalog to query and analyze the data with Amazon Athena.
- Start the Kafka broker and ZooKeeper as described in the setup instructions.
- Run the Kafka producer script to simulate and stream stock market data.
- Run the Kafka consumer script to read the streamed data and upload it to S3.
- Use AWS Glue and Athena to catalog and query the data.
This project provides a scalable and efficient pipeline for real-time stock market data streaming and analysis using Kafka and AWS services. The architecture leverages the power of distributed systems and cloud computing to handle large volumes of data with ease.
For any queries, please reach out to Kartik Pandit at kartikpandit712@gmail.com.
