-
Notifications
You must be signed in to change notification settings - Fork 15
TrainClassifier tutorial
TrainClassifier tells Minorthird to learn a classifier for labeling entire documents (such as email spam) using labeled training data. For this type of experiment, a directory of labeled documents is needed. We will use sample3.train, which is built into the code and requires no additional setup, for the example. To see how to label and load your own data for this task, look at the Labeling and Loading Data Tutorial. The output of this task is a text classifier annotator which can be tested on more labeled documents using TestClassifier or to add labels to documents using ApplyAnnotator.
To run this type of task start with:
$ java –Xmx500M edu.cmu.minorthird.ui.TrainClassifierLike all UI tasks, all the parameters for TrainClassifier may be specified either using the GUI or the command line. To use the GUI, simply type –gui on the command line. It is also possible to mix and match where the parameters are specified. For example, one can specify two parameters on the command line and use the GUI to select the rest. For this reason, the step-by-step process for this experiment will first explain how to select a parameter value in the GUI and then how to set the same parameter on the command line.
To view a list of parameters and their functions run:
$ java –Xmx500M edu.cmu.minorthird.ui.TrainClassifier –helpor
% java –Xmx500M edu.cmu.minorthird.ui.TrainClassifier –guiClick on the Parameters button next to Help or and click on the ? button next to each field in the Property Editor to see what it is used for. If you are using the GUI, click the Edit button next to TrainClassifier. A Property Editor window will appear:

There are four bunches of parameters to specify for this experiment. The only required parameters are labelsFilename (-labels) and spanType or spanProp.
-
baseParameterscontains the options for loading the collection of documents. - GUI: enter
sample3.trainin thelabelsFilenametext field. - Command Line: use the
–labelsoption followed by the repository key or the directory of files to load. For this tutorial specify–labels sample3.train. -
saveParameterscontains one parameter for specifying a file to save the result to. Saving is optional, but useful for using resulting classifier forTestClassifierandApplyAnnotatorexperiments. - GUI: type
sample3.annin thesaveAstext field. - Command Line:
-saveAs sample3.ann -
signalParameters: eitherspanTypeorspanPropmust be specified as the type to learn. For this experiment we will use spanTypefun. - GUI: Click the
Editbutton next tosignalParameters. Selectfunfrom the pull down menu next tospanType. - Command Line:
–spanType fun -
additionalParameterscontains parameters for specifying learning options, most importantly the learner used. We will use the default learner,NaiveBayes, for this experiment, but feel free to change the learner for future experiments. - GUI: change the learner by selecting a new learner from the pull down menu
- Command Line: selecting a different learner (or any other class) on the command line can be tricky. The full class must be specified. To get more information on learner classes, look at the API Javadoc. Most learner may be specified on the command line like this:
-learner "new Recommended.LEARNER_NAME()". Check the API Javadoc for possible initialization parameters. - Feel free to try changing any of the other parameters including the ones in
advanced options. - GUI: Click on the help buttons to get a feeling for what each parameter does and how changing it may affect your results. Once all the parameters are set, click the
OKbutton onProperty Editor. - Command Line: Add other parameters to the command line (use
–helpoption to see other parameter options). If there is an option that can be set in the GUI, but there is no specific parameter for setting it in the help parameter definition, the–otheroption may be used. To see how to use this option, look at the Command Line Other Option Tutorial. - If you are using the GUI, once finished editing parameters, save parameter modification by clicking the
OKbutton onProperty Editor.
- GUI: press the
Show Labelsbutton if you would like to view the input data for the classification task. - Command Line: add
–showLabelsto the command line.
- Opening the result window:
- GUI: Press
Start TaskunderExecution Controlsto run the experiment. The task will vary in the amount of time it takes depending on the size of the data set and what learner and splitter you choose. When the task is finished, the error rates will appear in the output text area along with the total time it took to run the experiment. - Command Line: specify
–showResult(this is for seeing the graphical result, if this option is not set, only the basic statistics of the task will be seen).
-
Once the experiment is completed, click the
View Resultsbutton in theExecution Controlssection to see detailed results in the GUI or the window will automatically appear if the–showResultoption was chosen on the command line. The resulting classifier will appear:
The features in the extractor may be sorted by name (as shown above), weight, or absolute weight or may be viewed in a tree where the root contains the highest value of the leaves below. Features with the largest weights are most highly correlated with the specified
SpanType. -
Press the
Clear Windowbutton to clear all output from the output and error messages window.