-
Notifications
You must be signed in to change notification settings - Fork 0
User Guide Scala
In lemkit-model, model objects for prediction are held in the LinearClassifier class. This is a subclass of Classifier, which can be viewed as a function mapping data instances to predictions; hence, you can call it as a function to return a prediction, e.g. classifier(features). You can also retrieve the full set of predicted scores for a given data instance using scores(), which returns a sequence of tuples of (label, score). To retrieve the list of labels, use labels(). LinearClassifier objects can be read from binary or JSON-formatted files using (respectively) LinearClassifier.readBinaryModel() and LinearClassifier.readJSONModel().
Both labels and features are externally represented as strings, but internally converted to integer indices. Indexing of labels to integers is always exact, using a hash table (controlled to the class LabelMap). Indexing of features to integers may be exact (using ExactFeatureMap) or using feature hashing (using HashedFeatureMap).
Features as passed to the functions of Classifier are of type FeatureSet[String], which is currently a type alias for Seq[FeatureObservation[String]]. Each FeatureObservation encapsulates a single feature and its value.
There is also a class Example that encapsulates a complete data instance, i.e. a set of features, a label, and an optional importance weight. Example is type-parameterized on the feature and label types. Currently the code uses only Example[String, String] (the externally-visible view of an Example, with features and labels represented as strings) and Example[Int, Int] (the internal representation of an Example, with features and labels indexed to integers).
Example and FeatureObservation are case classes, and can be used to directly create data instances, e.g.:
val feats = Seq("foo" -> 2.0, "bar" -> 3.0).map {
case (feat, value) => FeatureObservation(feat, value)
}
val instance = Example(feats, "positive")
You can also read in a set of instances from a data file (in the format described in Input Format) using ClassifierSource:
val instances = ClassifierSource.readDataFile("data.predict")
The value returned by this function is an iterator; you may want to convert it to a sequence using toSeq.
The following code, also found in the Quick Start, shows how to read in a model and a set of instances and do prediction:
import com.peoplepattern.classify._
import com.peoplepattern.classify.data._
val classifier = LinearClassifier.readBinaryModel("model.binary")
val predictData = ClassifierSource.readDataFile("data.predict").toSeq
val predictions = predictData.map(i => classifier(i.features))
for ((prediction, inst) <- predictions zip predictData) {
println(s"Predicted label: ${prediction}, correct label: ${inst.label}")
}
lktrain can be used to train and write out classification models (as well as to do prediction, either using a model trained at the same or a previously-written-out model, trained either using Scala or Python).
lktrain takes arguments as follows:
| Argument | Meaning |
|---|---|
| `--method | -m` |
| `--train | -t` |
| `--predict | -p` |
| `--read-model | -r` |
| `--write-model | -w` |
| `--model-format | --mf |
--hashtrick |
Do feature hashing with specified # of features |
--l2 |
Do L2 regularization as specified (default 1.0) |
--l1 |
Do L1 regularization with specified feature mult |
| `--verbose | v` |
--default-options |
Control default options to training executable |
--extra-options |
Specify additional options to training executable |
Either --train (or -t) or --read-model (or -r) is required. If --train is specified, a new linear classifier will be trained using the data in the specified training file; otherwise, an existing model will be read in from the model file specified using --read-model.
Training is done using either Vowpal Wabbit or LIBLINEAR, according to the value of --method. This requires that either the Vowpal Wabbit executable vw or the LIBLINEAR executable train is available in your PATH.
If --write-model is specified, a model will be written out to the specified file.
If --predict is specified, prediction will be done using the data in the specified file and the model previously created or read in. Note that this partly duplicates functionality available in lkpredict (see below).
Various arguments are available for controlling how classifiers are trained:
For Vowpal and LIBLINEAR, L2 regularization is done by default, using a penalty value of 1.0; this can be changed using --l2. For Vowpal, to do feature selection with L1 regularization, use --l1 along with a feature multiple. This value controls the desired number of features, which is computed by multiplying the number of training examples by this value. A search will be done over the actual L1 penalty until the number of features produced is close to the desired number (specifically, between the desired number minus 1/4 of the number of examples and the desired number plus 1/2 of the number of examples). NOTE: Not currently implemented for LIBLINEAR.
Additional options for Vowpal or LIBLINEAR can be specified using --default-options to override the default options, or --extra-options to specify additional options. Default options for Vowpal are --bfgs --passes 100 --loss_function logistic --holdout_off; this uses BFGS instead of SGD (stochastic gradient descent) with logistic loss and the early-stopping holdout mechanism disable. Default options for LIBLINEAR are -s 7; this does dual L2-regularized logistic regression.
By default, a model is created using exact features, meaning that every distinct feature has its own parameter. Feature hashing (aka the "hash trick") can be done with --hashtrick, with a specified maximum number of features.
This maps each feature to a hash value within the specified range. This can potentially lead to dramatic improvements in speed and space efficiency. See the Wikipedia article on feature hashing.
The format of the model files can be either JSON or binary, according to the --model-format argument (defaults to binary).
The file passed to --train and --predict is as described in Input Format.
lkpredict is used to do prediction on linear classifier models trained using lemkit-train. It takes arguments as follows:
| Argument | Meaning |
|---|---|
| `--model-format | --mf |
| `--predict | -p` |
| `--model | -m` |
| `--show-accuracy | -a` |
| `--show-correct | -c` |
The arguments --predict (or -p) and --model (or -m) are required.
A basic invocation of lkpredict might be:
lkpredict -m vw.iris.exact.model.bin -p iris.data.test.txt
The file passed to --predict is as described in Input Format.
The file passed to --model should be a binary-format or JSON-format model file as created using lktrain.
The predictions are sent to stdout, normally formatted as follows:
1 Iris-setosa Iris-setosa
2 Iris-versicolor Iris-versicolor
...
26 Iris-versicolor Iris-virginica
...
Each line consists of a line number, then the correct label, then the predicted label.
If --show-correct (or -c) is used, a second column is added indicating whether the prediction was correct or wrong. If --show-accuracy (or -a) is used, a line at the end is output showing overall accuracy. For example,
executing the following:
lkpredict --model vw.iris.exact.model.bin \
--predict iris.data.test.txt --show-accuracy --show-correct
Might produce output as follows:
1 CORRECT Iris-setosa Iris-setosa
2 CORRECT Iris-versicolor Iris-versicolor
...
26 WRONG Iris-versicolor Iris-virginica
...
30 CORRECT Iris-setosa Iris-setosa
Accuracy: 93.33%