This repository contains simple example SQL scripts and CSV datasets to explore a small data warehouse-style project using SQL Server (mssql).
Contents
-
datasets/- sample CSV files (gold.dim_customers.csv, gold.dim_products.csv, gold.fact_sales.csv) -
scripts/- useful SQL scripts:initialize_database.sql— creates theDataWarehouseAnalyticsdatabase,goldschema, tables, and bulk-loads data from CSV files (this will DROP the database if it exists)database_exploration.sql— lists tables and columnsdate_exploration.sql— date range and age queriesdimentsions_exploration.sql— (typo: "dimentions") explore dimension tablesmagnitute_analysis.sql— aggregation and revenue queriesmeasures_exploration.sql— measures and summary report
-
ranking_analysis.sql— top/bottom ranking queries🧩 Dimensions vs. Measures
EDA starts by distinguishing between two types of data fields:
-
Dimensions – Qualitative or categorical fields that describe data. Examples: Category, Product, Region, Gender, Customer_ID
-
Measures – Quantitative or numerical fields that can be aggregated. Examples: Sales, Quantity, Profit, Age
➡️ Rule of thumb: If a column is numeric and makes sense to sum or average, it’s a measure. Otherwise, it’s a dimension.
📏 Dimensions Exploration
Dimension exploration helps identify unique values and distribution across categorical fields.
Common SQL operations:
-
SELECT DISTINCT column_name -
COUNT(*)/COUNT(DISTINCT column_name) -
GROUP BY column_namewithCOUNT(),AVG(),SUM() -
Percentiles / frequency distributions (e.g., COUNT() / total)
📆 Date Exploration
Dates are crucial for understanding time trends and activity periods.
Common SQL operations:
- MIN(order_date), MAX(order_date) — find dataset span
- DATEDIFF / DATEPART — compute durations or extract year/month/week
- GROUP BY YEAR(order_date), MONTH(order_date) — time series aggregations
- Rolling/window functions for moving averages (e.g., OVER(ORDER BY order_date ROWS BETWEEN ...))
📈 Measures Exploration
Measure exploration summarizes numerical columns using aggregations.
Common SQL operations:
- SUM(sales_amount), AVG(price), MIN(), MAX()
- COUNT(DISTINCT order_number) — orders vs items
- Distribution checks (histograms, percentiles) using NTILE or PERCENTILE_CONT
📊 Magnitude Analysis
Magnitude analysis connects measures with dimensions to show how metrics vary across categories.
Common SQL operations:
- GROUP BY category -> SUM(sales_amount) to see category revenue
- ORDER BY SUM(sales_amount) DESC to find largest contributors
- JOIN fact -> dimension tables to attribute measures to descriptive fields
🏆 Ranking Analysis
Ranking is used to find top or bottom performers in a dataset.
Common SQL operations:
- ROW_NUMBER(), RANK(), DENSE_RANK() with ORDER BY SUM(...) DESC/ASC
- Use a CTE/window function then filter by rank (e.g., WHERE rank <= 10)
-
For more advanced SQL warehouse exploration, see advanced-data-analytics.