This repository contains Terraform configurations for CDC Data pipeline from RDS MySQL DB to BigQuery using Confluent Cloud. It aims to provide infrastructure as code (IaC) to manage resources and deploy applications on AWS, GCP and Confluent over private networking and stream governance.
- AWS: RDS, RDS Proxy, PrivateLink Service, Network Load balancer, Secrets Manager, EC2, VPC Endpoint, Route53
- GCP: BigQuery, IAM
- Confluent: PLATT Networking on AWS, BQ & Mysql CDC V2 Connectors, Schema Registry, Enterprise Kafka Cluster
- Terraform: Version
- Install Terraform on your local machine by following the installation guide.
- Cloud Provider CLI: (AWS, GCP & Confluent Cloud)
- For AWS, refer to the AWS CLI setup.
- For GCP, refer to the GCP CLI setup.
- For Confluent Cloud to the Confluent CLI setup.
- Cloud Provider Access: (AWS, GCP & Confluent Cloud)
- AWS: ACCESS_KEY & SECRET_KEY - Network Administrator, RDS, KMS, Secret Manager, EC2
- GCP: Service Account - BigQuery writer, IAM Editor
- Confluent: Cloud API KEY & SECRET
- Linux Tools - mysql client, git, vscode
- Terraform Cloud/State backend (optional): If using a remote backend for storing the Terraform state, ensure that it is set up (e.g., AWS S3, HashiCorp Consul, etc.).
-
Create an EC2 Instance in the data tenant vpc that belongs to a public subnet with a public ip, accessible over SSH
aws configure # Provide AWS ACCESS, SECRET & DEFAULT REGION # See the VPC and subnets aws ec2 describe-vpcs --query "Vpcs[].VpcId" --output text # Select the RDS Tenant VPC and Look for Public subnets aws ec2 describe-subnets \ --filters Name=vpc-id,Values=<your_vpc_id> Name=map-public-ip-on-launch,Values=true \ --query "Subnets[].SubnetId" --output text # Create Security group for Confluent bastion aws ec2 create-security-group --group-name ConfluentBastionSSHAccessGroup --description "Security group for SSH access" aws ec2 authorize-security-group-ingress --group-name ConfluentBastionSSHAccessGroup --protocol tcp --port 22 --cidr 0.0.0.0/0 aws ec2 authorize-security-group-egress --group-name ConfluentBastionSSHAccessGroup --protocol "-1" --port 0 --cidr 0.0.0.0/0 # Create SSH Key pair for the bastion host aws ec2 create-key-pair --key-name ConfluentBastion --query 'KeyMaterial' --output text > ConfluentBastion.pem chmod 400 ConfluentBastion.pem # Create the EC2 instance in the data tenant VPC and public subnet aws ec2 run-instances \ --image-id ami-05f157b283f1f33b9 \ # Example Ubuntu AMI (ensure it's the correct one for your region - https://cloud-images.ubuntu.com/locator/ec2/) --count 1 \ --instance-type t3.large \ # Choose the instance type (t2.micro is eligible for the free tier) --key-name ConfluentBastion \ # The SSH key pair you created earlier --security-groups ConfluentBastionSSHAccessGroup \ # The security group with SSH access --subnet-id <subnet_id> \ # Public subnet ID from the list you obtained --associate-public-ip-address # Ensure the EC2 instance gets a public IP
-
In the VSCode, Connect the EC2 instance bastion over SSH
# ~/.ssh/config Host AWS_CONFLUENT_BASTION HostName <EC2-PUBLIC-ENDPOINT> User ubuntu IdentityFile <ABSOLUTE-PATH-TO-ConfluentBastion.pem>
-
Once inside the vscode, clone the repository
git clone https://github.com/flashiam12/migration-pipeline.git cd migration-pipeline code . # Install gcloud, confluent, aws and terraform CLI aws configure gcloud auth application-default login confluent login terraform init
-
Setup the terraform variables
cp terraform.tfvars.sample terraform.tfvars # Provide all the required values to the variablesName Version aws 5.88.0 confluent 2.18.0 dns 3.4.2 google 6.22.0 null 3.2.3 No modules.
Name Description Type Default Required aws_db_subnet_ids n/a list(string)n/a yes aws_db_subnet_zones n/a list(string)n/a yes aws_db_vpc_id n/a stringn/a yes aws_rds_mysql_db n/a stringn/a yes aws_rds_mysql_instance_name n/a stringn/a yes aws_rds_mysql_password n/a stringn/a yes aws_rds_mysql_tables n/a list(string)n/a yes aws_rds_mysql_user n/a stringn/a yes aws_region n/a stringn/a yes cc_cloud_api_key n/a stringn/a yes cc_cloud_api_secret n/a stringn/a yes cc_create_network n/a booltrueno cc_create_ops_service_account n/a booltrueno cc_env n/a stringn/a yes cc_kafka_cluster_name n/a stringn/a yes cc_kafka_cluster_type n/a string"enterprise"no cc_kafka_create_cluster n/a booltrueno cc_network_name n/a stringn/a yes gcp_bigquery_dataset n/a stringn/a yes gcp_bq_service_account_json_file n/a stringn/a yes gcp_bq_service_account_name n/a stringn/a yes gcp_project_id n/a stringn/a yes No outputs.
# Enabling RDS with proper binlog config for cdc on mysql
aws rds create-db-parameter-group \
--db-parameter-group-name confluent-mysql8 \
--db-parameter-group-family MySQL8.0 \
--description "Parameter group binlog setting for cdc"
aws rds modify-db-parameter-group \
--db-parameter-group-name confluent-mysql8 \
--parameters "ParameterName=binlog_format,ParameterValue=ROW,ApplyMethod=immediate"
aws rds modify-db-instance --db-instance-identifier <YOUR_DB_INSTANCE_IDENTIFIER> --db-parameter-group-name confluent-mysql8 --apply-immediately
aws rds reboot-db-instance --db-instance-identifier <YOUR_DB_INSTANCE_IDENTIFIER>
# Running the terraform plan
terraform init
terraform plan
terraform apply This document outlines the steps to install and configure an NGINX proxy to route traffic to your Confluent Cloud cluster. This setup uses Server Name Indication (SNI) to direct traffic to the appropriate servers on ports 443 and 9092.
- A Virtual Machine (VM) in your VPC or VNet that is connected to Confluent Cloud.
- Access to the Confluent Cloud Console.
- Basic Linux command-line knowledge.
-
Provision a VM:
- Create a VM in your VPC or VNet that has network connectivity to your Confluent Cloud environment. Default VM properties are sufficient.
-
Install NGINX:
- Connect to your VM via SSH.
- For Ubuntu/Debian:
sudo apt update sudo apt install nginx
- For RedHat:
sudo yum install nginx
-
Test NGINX Configuration:
- Verify the NGINX installation and configuration syntax:
nginx -t
- Verify the NGINX installation and configuration syntax:
-
Enable
ngx_stream_module(if needed):- If you encounter an error related to
ngx_stream_module.so, locate the module. Common locations are/usr/lib/nginx/modulesor/usr/lib64/nginx/modules. - Add the following line to the top of
/etc/nginx/nginx.conf:load_module /usr/lib/nginx/modules/ngx_stream_module.so; #adjust the path if needed
- Re-test the configuration:
nginx -t
- If you encounter an error related to
-
Configure NGINX for SNI Routing:
- Replace the contents of
/etc/nginx/nginx.confwith the following:events {} stream { map $ssl_preread_server_name $targetBackend { default $ssl_preread_server_name; } server { listen 9092; proxy_connect_timeout 1s; proxy_timeout 7200s; resolver 127.0.0.53; proxy_pass $targetBackend:9092; ssl_preread on; } server { listen 443; proxy_connect_timeout 1s; proxy_timeout 7200s; resolver 127.0.0.53; proxy_pass $targetBackend:443; ssl_preread on; } log_format stream_routing '[$time_local] remote address $remote_addr' 'with SNI name "$ssl_preread_server_name" ' 'proxied to "$upstream_addr" ' '$protocol $status $bytes_sent $bytes_received ' '$session_time'; access_log /var/log/nginx/stream-access.log stream_routing; }
- Important: Do not replace
$targetBackend. This variable is used for SNI routing.
- Replace the contents of
-
Verify DNS Resolver:
- Test the resolver configuration:
nslookup <ConfluentCloud_BootstrapHostname> 127.0.0.53
- Replace
<ConfluentCloud_BootstrapHostname>with your Confluent Cloud bootstrap hostname.
- Replace
- Check
/var/log/nginx/error.logfor resolver errors. - If DNS resolution fails, adjust the
resolverdirective in bothserverblocks:- AWS:
resolver 169.254.169.253; - Azure:
resolver 168.63.129.16; - Google Cloud:
resolver 169.254.169.254;
- AWS:
- Test the resolver configuration:
-
Restart NGINX:
- Apply the changes:
sudo systemctl restart nginx
- Apply the changes:
-
Verify NGINX Status:
- Ensure NGINX is running:
sudo systemctl status nginx
- Ensure NGINX is running:
-
Configure DNS Resolution:
- On your local machine (not the proxy VM), update your DNS configuration (e.g.,
/etc/hosts) to route Confluent Cloud traffic through the proxy. - Add lines similar to the following, replacing placeholders with your VM's public IP and Confluent Cloud endpoints:
<Public IP Address of VM instance> <Kafka-REST-Endpoint> <Public IP Address of VM instance> <Flink-private-endpoint>- Retrieve the
<Kafka-REST-Endpoint>from the Confluent Cloud Console. - The Kafka bootstrap and REST endpoints often share the same hostname, differing only in port number.
- Retrieve the
- On your local machine (not the proxy VM), update your DNS configuration (e.g.,
- Ensure your VM's security groups allow inbound traffic on ports 443 and 9092.
- The
proxy_timeoutis set to 7200 seconds (2 hours). Adjust as needed. - This setup assumes your Confluent Cloud cluster uses standard ports 443 and 9092.
- If you are using a firewall on the VM, ensure it allows connections to the Confluent Cloud cluster.
terraform destroy