- Sparkify startup requested to move user base and song database to the cloud.
- Extract, Transform and Load (ETL) pipeline extracts JSON data files from Amazon Simple Storage Service (S3) data storage and load them into a star schema relational database on Amazon Redshift data warehouse.

Data for song and user activities reside in S3 as JSON files:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
- Log data json path: s3://udacity-dend/log_json_path.json
- subset of real data from the Million Song Dataset.
- each file is in JSON format
- contains metadata about a song and the artist of that song.
- files are partitioned by the first three letters of each song's track ID.
- example of file paths to two files in song dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
- example of single song file, TRAABJL12903CDCF1A.json:
{"num_songs": 1,
"artist_id": "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0}
- log files in JSON format generated by event simulator based on the songs in the song dataset.
- these simulate activity logs from a music streaming app based on specified configurations.
- log files are partitioned by year and month.
- example, here are filepaths to two files in log dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
- example of single log file, 2018-11-12-events.json:
{"artist":null,
"auth":"Logged In",
"firstName":"Celeste",
"gender":"F",
"itemInSession":0,
"lastName":"Williams",
"length":null,
"level":"free",
"location":"Klamath Falls, OR",
"method":"GET",
"page":"Home",
"registration":1541077528796.0,
"sessionId":438,
"song":null,
"status":200,
"ts":1541990217796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/37.0.2062.103 Safari\/537.36\"",
"userId":"53"}
staging table for the song data
| Column | Type |
|---|---|
| num_songs | varchar |
| artist_id | varchar |
| latitude | float |
| longitude | float |
| location | varchar |
| name | varchar |
| song_id | varchar |
| title | varchar |
| duration | float |
| year | int |
staging table for the log data
| Column | Type |
|---|---|
| artist | varchar |
| auth | varchar |
| first_name | varchar |
| gender | char(1) |
| item_in_session | int |
| last_name | varchar |
| length | float |
| level | varchar |
| location | varchar |
| method | varchar |
| page | varchar |
| registration | varchar |
| session_id | int |
| song | varchar |
| status | int |
| ts | timestamp |
| user_agent | varchar |
| user_id | int |
- Star schema that contains 1 fact table (songplays) and 4 dimension tables (users, songs, artists and time)
records in log data associated with song plays i.e. records with page NextSong
| Column | Type | Nullable |
|---|---|---|
| songplay_id (PK) | SERIAL | NOT NULL |
| start_time | timestamp | NOT NULL |
| user_id | int | NOT NULL |
| level | varchar | |
| song_id | varchar | |
| artist_id | varchar | |
| session_id | int | |
| location | varchar | |
| user_agent | varchar |
- Distribution Style: KEY start_time
- Sorting key: start_time
users in the app
| Column | Type | Nullable |
|---|---|---|
| user_id (PK) | int | NOT NULL |
| first_name | varchar | |
| last_name | varchar | |
| gender | varchar | |
| level | varchar |
- Sorting key: user_id
songs in music database
| Column | Type | Nullable |
|---|---|---|
| song_id (PK) | varchar | NOT NULL |
| title | varchar | NOT NULL |
| artist_id | varchar | NOT NULL |
| year | int | |
| duration | numeric | NOT NULL |
- Sorting key: song_id
artists in music database
| Column | Type | Nullable |
|---|---|---|
| artist_id (PK) | varchar | NOT NULL |
| name | varchar | |
| location | varchar | |
| latitude | float | |
| longitude | float |
- Sorting key: artist_id
timestamps of records in songplays broken down into specific units
| Column | Type | Nullable |
|---|---|---|
| start_time (PK) | timestamp | NOT NULL |
| hour | int | |
| day | int | |
| week | int | |
| month | int | |
| year | int | |
| weekday | int |
- Distribution Style: KEY start_time
- Sorting key: start_time
- contains all the sql queries
- drops and creates the tables.
- run this file to reset the tables before each time you run the ETL scripts.
- reads and processes files from S3
- loads data into the tables.
- configuration file for Redshift, IAM and S3
[CLUSTER]
HOST=''
DB_NAME=''
DB_USER=''
DB_PASSWORD=''
DB_PORT=
[IAM_ROLE]
ARN=''
[S3]
LOG_DATA='s3://udacity-dend/log_data'
LOG_JSONPATH='s3://udacity-dend/log_json_path.json'
SONG_DATA='s3://udacity-dend/song_data'
-
launch Redshift cluster
-
setup dwh.cfg file
-
open terminal
-
run create_tables.py to drop an create tables:
python create_tables.py
- run etl.py to process files and load data:
python etl.py