Skip to content
benwing edited this page Jan 4, 2015 · 4 revisions

Scala Quickstart

Lemkit comes in two packges, lemkit-train and lemkit-model. lemkit-train can be used to create and save linear classification models, which can then be read using lemkit-model. There are command-line applications to do both classification and prediction, and this can also be done using library functions.

Library functions

lemkit-train

This library depends on lemkit-model. See the documentation for that package for an introduction to feature observations and examples.

com.peoplepattern.classify.ClassifyApp implements lktrain and is a good source of example code.

Imagine you have a data file called data.train containing data instances in the format described in Input Format. Then use the following code. Note that in order for this to work, the vw executable that runs Vowpal Wabbit needs to be installed somewhere in your PATH.

  import com.peoplepattern.classify._
  import com.peoplepattern.classify.data._

  val hashOptions = HashingOptions()
  val options = VowpalClassifierOptions(hashOptions)

  val trainData = ClassifierSource.readDataFile("data.train").toSeq
  val classifier = VowpalClassifier.train(trainData, options)
  LinearClassifier.writeBinaryModel(classifier, "model.binary")

You can set options for hashing using HashingOptions, and options for training Vowpal using VowpalClassifierOptions. For example, to turn on feature hashing using a maximum of 1000 features, use

  val hashOptions = HashingOptions(hashtrick = Some(1000))

You can test this code in a Scala REPL (read-eval-print loop) by running sbt console and then pasting the code in.

lemkit-model

Assuming you already have trained a model using lemkit-train, you can use it for predicting as follows. See also com.peoplepattern.classify.PredictApp, which implements bin/lkpredict and is a good source of example code.

Imagine you have saved out a binary model to a file called model.binary, and have a data file called data.predict containing data instances in the format described in Input Format. Then use code like this:

  import com.peoplepattern.classify._
  import com.peoplepattern.classify.data._

  val classifier = LinearClassifier.readBinaryModel("model.binary")
  val predictData = ClassifierSource.readDataFile("data.predict").toSeq
  val predictions = predictData.map(i => classifier(i.features))
  for ((prediction, inst) <- predictions zip predictData) {
    println(s"Predicted label: ${prediction}, correct label: ${inst.label}")
  }

If you want the scores for each possible label, use classifier.scores in place of just classifier (which calls the apply function).

You can test this code in a Scala REPL (read-eval-print loop) by running sbt console and then pasting the code in.

Command-line applications

lktrain

In lemkit-train, the command-line app lktrain can be used to train and write out classification models (as well as to do prediction, either using a model trained at the same or a previously-written-out model, trained either using Scala or Python). The following command line does the equivalent of the library code above for lemkit-train:

  lktrain -m vowpal -f binary -t data.train -w model.binary

Full documentation is available in the User Guide.

lkpredict

lkpredict is used to do prediction on linear classifier models trained using lemkit-train. The following command line does the equivalent of the library code above for lemkit-model:

  lkpredict -m vowpal -f binary -p data.predict

Full documentation is available in the User Guide.

Clone this wiki locally