This project involves performing Exploratory Data Analysis (EDA) on a given dataset, where we aim to clean, process, and visualize the data to uncover valuable insights. The steps followed in this analysis are outlined below:
We begin by importing all the essential libraries needed for EDA, such as pandas, numpy, and matplotlib.
The dataset is loaded into a pandas DataFrame. We then display the top 5 and last 5 rows to get an overview of the data.
We check the data types of each column in the dataset to ensure correct interpretation for analysis.
Irrelevant columns like Engine Fuel Type, Market Category, Vehicle Style, Popularity, Number of Doors, and Vehicle Size are dropped as they are not required for analysis.
To simplify the column names, we rename certain columns:
Engine HP→HPEngine Cylinders→CylindersTransmission Type→TransmissionDriven_Wheels→Drive Modehighway MPG→MPG-Hcity mpg→MPG-CMSRP→Price
We display the original shape of the dataset, check for duplicate rows, and then drop them to ensure data integrity.
We calculate and display important statistics for all numerical columns, including sum, mean, standard deviation, minimum, percentiles, and maximum values.
We display the sum of missing or null values for each column, drop rows with missing data, and confirm the cleaning process by checking again.
Various plots are created to analyze the dataset:
- Horsepower (HP) vs Price
- Sales by Year
- Number of Cars in Each Year
- Preferred Drive Mode (most common Drive Mode)
- Highway MPG (MPG-H) vs City MPG (MPG-C)
- Transmission Type vs MPG-H and MPG-C
We filter the numeric columns, calculate the correlation matrix, and plot a heatmap to visualize the relationships between variables.
Additional insights are explored, with creative and experimental analysis to uncover further patterns in the data.
Feel free to explore the dataset and the analysis further to gain deeper insights.