-
Notifications
You must be signed in to change notification settings - Fork 413
Description
I have data with about Million rows and 3 columns. The columns are of 3 different datatypes. NumberOfFollowers is of a numerical datatype, UserName is of a categorical data type, Embeddings is of categorical-set type.
df:
Index NumberOfFollowers UserName Embeddings Target Variable
0 15 name1 [0.5 0.3 0.2] 0
1 4 name2 [0.4 0.2 0.4] 1
2 8 name3 [0.5 0.5 0.0] 0
3 10 name1 [0.1 0.0 0.9] 0
... ... .... ... ..
I would like to convert this data into the LibSVM input format.
Desired Output:
0 0:15 4:1 1:0.5 2:0.3 3:0.2
1 0:4 5:1 1:0.4 2:0.2 3:0.4
0 0:8 6:1 1:0.5 2:0.5 3:0.0
0 0:10 4:1 1:0.1 2:0.0 3:0.9
...
The Perl script https://github.com/srendle/libfm/blob/master/scripts/triple_format_to_libfm.pl handles categorical values. But, how to handle mixture of data types as also described in this paper: https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle_et_al2011-Context_Aware.pdf
Can this problem be solved using libfm or I have to use external tools? If I need to use external tools, are you aware of any external tools which perform this operation on a very large scale data (as I have many columns of mixed data types)?