This project aims to identify the relationship between weather—specifically, precipitation and snow levels—on Nasdaq trading volumes and closing prices.
To run the pipeline, git clone this repo to your local machine and download the 3 data files here. Use bash run_script.sh on your local terminal to run the entire pipeline in one command.
The Tableau visualization can be accessed through this link.
We pre-processed our data before loading it into our pipeline. We filtered our data to only include data points from the year 2016 using the spark_preprocessing.ipynb notebook, and yielded two datasets in parquet format (weather.snappy.parquet and nasdaq.snappy.parquet). A third dataset was added later.
You can skip the pre-processing step and directly download the data files through this link.
The original unprocessed data is available further down.
The pyspark_script.py file is used to run the data manipulation code, which includes some column removals, transformations, and joins.
The queries_script.py file is used to run the data aggregation code, which includes DuckDB queries that output 6 CSV files to be used for our Tableau visualizations.
Once all the required data files are downloaded and available, you can skip manually doing Steps 1 & 2 by simply running the run_script.sh file by using the following command on your local command line:
bash run_script.sh
Step 3 will give you 6 CSV files as outputs. These files were visualized to draw our conclusions.
We used Tableau Public for our visualizations. Our dashboard can be accessed through this link.
You can find the original sources to our data at the links below.
Global Daily Weather Data
Same columns as in weather.snappy.parquet.
GHCN_DIN: Global Historical Climatology Network Daily Identification Number
DATE (year-month-day)
PRCP: precipitation (tenths of mm)
SNOW: snowfall (mm)
TMAX: daily maximum temperature (Cº)
TMIN: daily minimum temperature (Cº)
NAME: weather station name
ELEVATION: elevation (meters)
COUNTRY_CODE: two-letter country code
COORD: latitude and longitude of the station, in decimal degrees
Nasdaq data
To download this, you will need to sign up for an API key.
Same columns as in nasdaq.snappy.parquet.
ticker: stock ticker
date (year-month-day)
open: first price that the stock was traded on that day
high: highest price that the stock reached on that day
low: lowest price that the stock reached on that day
close: last price that the stock was traded on that day
volume: total number of shares traded on that day
ex-dividend: cash dividend per share paid by the company on that day, adjusted for stock splits
split_ratio: ratio of a stock split that occurred on that day
adj_open: open, adjusted for stock splits and dividends
adj_high: high, adjusted for stock splits and dividends
adj_low: low, adjusted for stock splits and dividends
adj_close: close, adjusted for stock splits and dividends
adj_volume: volume, adjusted for stock splits
Same columns as in nasdaq_industries.csv.
Symbol: stock ticker
Name: company name
Last Sale: most recent price at which the company's stock was traded
Net Change: difference between the last sale price and the previous day's closing price
% Change: percentage change in the stock's price compared to the previous day's closing price
Market Cap: total market value of a company's outstanding shares
Country: country where the company is headquartered.
IPO Year: year the company first offered its shares to the public through an Initial Public Offering
Sector: company sector
Industry: company industry
- Testing GCP integration
This project was made by:
Alyssa Fontaine
Eliz Zhou
Jane Lee
Joshua Bastin
Megan Bennett
Yash Laddha