Skip to content

Spark code and results#28

Open
KiprasKancys wants to merge 4 commits intodmwm:masterfrom
KiprasKancys:master
Open

Spark code and results#28
KiprasKancys wants to merge 4 commits intodmwm:masterfrom
KiprasKancys:master

Conversation

@KiprasKancys
Copy link
Contributor

No description provided.

"Load, add label col and convert data into label and feature type dataframe (needed for MLlib)"

lines = ctx.textFile(fname)
parts = lines.map(lambda l: l.split(","))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kipras, there is a split spark sql_function which will be much faster than python one. Try to google it and look-up examples with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use spark split function, but we need to change a few things. We need to start with data file without headers. However later on when I want to join 'dataset' column with prediction results. I will need to know at least which column was a 'dataset' col.
I can do all of that, but we have to test if that will be faster.

There is a another option. We use spark split function on training data (which grows over the weeks and is much bigger AND we don't need column names here) and old method with validation data (which every week is more or less the same size). Because for output file we need dataset column.

Code to use sql split function:
lines = sc.textFile('dataFileWithoutHeader.csv')
row = Row("val")
df = lines.map(row).toDF()
tr = df.select(split(df.val, ","))

def labelData(data): return data.map(lambda row: LabeledPoint(row[0][-1], row[0][:-1]))
trLabeled = labelData(tr)
training = sqlContext.createDataFrame(trLabeled, ['features','label'])

@vkuznet
Copy link
Contributor

vkuznet commented Jul 28, 2016

Kipras, feel free to contribute to this PR more, I'll keep it open until I back. You may check a few things I pointed out along the code lines.

classifiers=("AdaBoostClassifier" "BaggingClassifier" "DecisionTreeClassifier" "ExtraTreesClassifier" "GradientBoostingClassifier" "XGBClassifier" "KNeighborsClassifier" "RandomForestClassifier" "RidgeClassifier" "SGDClassifier")
#classifiers=("KNeighborsClassifier" "RidgeClassifier" "BaggingClassifier")
drops="nusers,totcpu,rnaccess,rnusers,rtotcpu,nsites,s_0,s_1,s_2,s_3,s_4,wct"
drops="campain,creation_date,tier_name,dataset_access_type,dataset_id,energy,flown_with,idataset,last_modification_date,last_modified_by,mcmevts,mcmpid,mcmtype,nseq,pdataset,physics_group_name,prep_id,primary_ds_name,primary_ds_type,processed_ds_name,processing_version,pwg,this_dataset,rnaccess,rnusers,rtotcpu,s_0,s_1,s_2,s_3,s_4,totcpu,wct,cpu,xtcrosssection"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kipras, please revert this back since I already committed the change to the master. You'll get conflict here.

This reverts commit 4ccae63.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants