ML-Tasks/copy_of_task2.py at main · devarshee-13/ML-Tasks · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# -*- coding: utf-8 -*-
"""Copy of Task2.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1R8ZkC-YCyg1lFNAR0DIYo2O4ewNMzujJ

#Task 2

Run the following cell to import your google drive
"""

from google.colab import drive
drive.mount("/content/gdrive")

"""Run the following cell to import the necessary libraries"""

import pandas as pd
import matplotlib.pyplot as plt

"""## Task 2.1

For aspiring data scientists and machine learning enthusiasts, it is vital that they understand the coding environment. We normally use a jupyter notebook for most of our programming. To understand its importance, lets perform a small exercise.

We are given 2 datasets, they have data of certain students from 2 different branches and their respective divisions.

Download the datasets from the links given below -

Task2p1_0 - https://drive.google.com/file/d/1sCGCErnf2iC_qKUHcGLrt2O2HwuiwZYD/view?usp=sharing

Task2p1_1 - https://drive.google.com/file/d/1Kb_TcAEfsUGi-KPZpsh8pXBaxfs4AsHh/view?usp=sharing

Create a folder named 'Task2p1' in your google drive and upload the datasets 'Task2p1_0' and 'Task2p1_1' there.

###Python Script
"""

data = pd.read_csv("/content/gdrive/MyDrive/Task2p1/Task2p1_0.csv")
data

data = pd.read_csv("/content/gdrive/MyDrive/Task2p1/Task2p1_1.csv")
data

"""###Jupyter Notebook"""

data = pd.read_csv("/content/gdrive/MyDrive/Task2p1/Task2p1_0.csv")

data

data = pd.read_csv("/content/gdrive/MyDrive/Task2p1/Task2p1_1.csv")

data

"""Note the difference between the outputs you get after using the code in cell 3 as a python script and after running it line by line in a jupyter notebook.

*Double click here and write your answer in the editor that shows up*

##Task 2.2
"""

from google.colab import drive
drive.mount('/content/drive')

"""Download the dataset of fifty random Harry Potter spells from the link given below -

https://drive.google.com/file/d/1F2YEtVZaorL0WPGxOC5vXkMSixuYkI4v/view?usp=sharing

Create a folder named 'Task2hp' in your google drive and upload the dataset there.

Observe the dataset, it is a .csv file. Using pandas library, read the .csv file into a dataframe in the next cell.

Tip : You can make use of the Files tab in the side bar on the left to navigate through your file structure.
"""

import pandas as pd
hp_df = pd.read_csv('/content/drive/MyDrive/Task2hp/HPSpells.csv')

"""Display first five rows of the dataframe."""

hp_df.head()

"""Remove all the rows having type as 'Hex'"""

hp_df.info()

hp_df.Type

hp_df.drop(hp_df[hp_df.Type=='Hex'].index, inplace=True)

"""Now remove the rows from the dataframe having NaN values for column 'Light'."""

hp_df.Light.isna().sum()

hp_df.dropna(subset=['Light'],inplace=True)

"""Filter out the spells on the basis of their Type and make a bar graph showing your analysis of the number of spells and their distribution in the five types - Charm, Jinx, Transfiguration, Curse."""

import matplotlib.pyplot as plt
import numpy as np

names=hp_df.Type.unique()
freq={}
for name in names:
  i=0
  for item in hp_df.Type:
    if item==name:
      i=i+1
  freq[name]=i

plt.figure(figsize=(20,7))
plt.bar(range(len(freq)),list(freq.values()))
plt.xticks(range(len(freq)),names)
plt.xlabel('Types of spells')
plt.ylabel('Number of spells')
plt.title('Spells Distribution')
plt.show()