Conversation
| "Load, add label col and convert data into label and feature type dataframe (needed for MLlib)" | ||
|
|
||
| lines = ctx.textFile(fname) | ||
| parts = lines.map(lambda l: l.split(",")) |
There was a problem hiding this comment.
Kipras, there is a split spark sql_function which will be much faster than python one. Try to google it and look-up examples with it.
There was a problem hiding this comment.
We can use spark split function, but we need to change a few things. We need to start with data file without headers. However later on when I want to join 'dataset' column with prediction results. I will need to know at least which column was a 'dataset' col.
I can do all of that, but we have to test if that will be faster.
There is a another option. We use spark split function on training data (which grows over the weeks and is much bigger AND we don't need column names here) and old method with validation data (which every week is more or less the same size). Because for output file we need dataset column.
Code to use sql split function:
lines = sc.textFile('dataFileWithoutHeader.csv')
row = Row("val")
df = lines.map(row).toDF()
tr = df.select(split(df.val, ","))
def labelData(data): return data.map(lambda row: LabeledPoint(row[0][-1], row[0][:-1]))
trLabeled = labelData(tr)
training = sqlContext.createDataFrame(trLabeled, ['features','label'])
|
Kipras, feel free to contribute to this PR more, I'll keep it open until I back. You may check a few things I pointed out along the code lines. |
| classifiers=("AdaBoostClassifier" "BaggingClassifier" "DecisionTreeClassifier" "ExtraTreesClassifier" "GradientBoostingClassifier" "XGBClassifier" "KNeighborsClassifier" "RandomForestClassifier" "RidgeClassifier" "SGDClassifier") | ||
| #classifiers=("KNeighborsClassifier" "RidgeClassifier" "BaggingClassifier") | ||
| drops="nusers,totcpu,rnaccess,rnusers,rtotcpu,nsites,s_0,s_1,s_2,s_3,s_4,wct" | ||
| drops="campain,creation_date,tier_name,dataset_access_type,dataset_id,energy,flown_with,idataset,last_modification_date,last_modified_by,mcmevts,mcmpid,mcmtype,nseq,pdataset,physics_group_name,prep_id,primary_ds_name,primary_ds_type,processed_ds_name,processing_version,pwg,this_dataset,rnaccess,rnusers,rtotcpu,s_0,s_1,s_2,s_3,s_4,totcpu,wct,cpu,xtcrosssection" |
There was a problem hiding this comment.
Kipras, please revert this back since I already committed the change to the master. You'll get conflict here.
This reverts commit 4ccae63.
No description provided.