Skip to content

Convert Data with mixed datatypes to LibSVM format #43

@sumitsidana

Description

@sumitsidana

I have data with about Million rows and 3 columns. The columns are of 3 different datatypes. NumberOfFollowers is of a numerical datatype, UserName is of a categorical data type, Embeddings is of categorical-set type.

df:

Index  NumberOfFollowers                  UserName                    Embeddings        Target Variable

0        15                                name1                      [0.5 0.3 0.2]       0
1        4                                 name2                      [0.4 0.2 0.4]       1
2        8                                 name3                      [0.5 0.5 0.0]       0
3        10                                name1                      [0.1 0.0 0.9]       0
...      ...                               ....                       ...                 ..

I would like to convert this data into the LibSVM input format.

Desired Output:

0 0:15 4:1 1:0.5 2:0.3 3:0.2
1 0:4 5:1 1:0.4 2:0.2 3:0.4
0 0:8 6:1 1:0.5 2:0.5 3:0.0
0 0:10 4:1 1:0.1 2:0.0 3:0.9
...

The Perl script https://github.com/srendle/libfm/blob/master/scripts/triple_format_to_libfm.pl handles categorical values. But, how to handle mixture of data types as also described in this paper: https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle_et_al2011-Context_Aware.pdf

Can this problem be solved using libfm or I have to use external tools? If I need to use external tools, are you aware of any external tools which perform this operation on a very large scale data (as I have many columns of mixed data types)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions