-
Notifications
You must be signed in to change notification settings - Fork 15
Mixup Tutorial
Mixup is a simple pattern-matching and information extraction language included with MinorThird. The name's an acronym for My Information eXtraction and Understanding Package. You can run a Mixup program in MinorThird using the UI package (which will be covered in the next section). For more information on the language, see Mixup Language. Also, it may be helpful to look at the Javadoc for Mixup and MixupProgram.
MinorThird's language for manipulating text is Mixup. Some sample Mixup programs can be found on William's website under "Teaching" under "June 21,23,25, 2005". Here is a sample program sample1.mixup with commentary:
defSpanType source1 = title: ... '(' [ ... ] ')' ;
Here defSpanType source1 defines source1 as the SpanType which is defined to the right of the equal sign. The expression to the right of the equal sign is the pattern that matches source1. This line says that source1 is in the title between the parentheses. Here is a list of what each part of the expression means:
-
defSpanType- keyword -
source1- name of the definedSpanType -
title:- start with title and match to the pattern defined in the remainder of the expression -
...- anything -
'('- the left parenthesis token -
[- START -
...anything -
]- END -
')'- the right parenthesis token
defSpanType source2 = description: [ !'-'+R ] '-' ... ;
This line is very similar to the line above, but contains a few new expressions:
-
!- not this token -
+- 1+ times -
R- extend to the right
To see the parameters for running a Mixup program, type:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –helpNow lets try running a sample mixup program. To do this make sure the sample programs are in your minorthird/lib/mixup directory. Do:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –showResultThe –showResult parameter will graphically display the output. Or do:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir –mixup sample1.mixup –guiPress the Start Task button to run the program. When the program is done running a window like this will appear:

This window looks similar to the one that appeared when you ran ViewLabels; however, you will notice that there are now 6 span types instead of 4 since sample1.mixup defined two more span types: source1 and source2. To see what the Mixup program extracted, try going to the SpanTypes tab and highlighting source1 and source2.
sample1a.mixup demonstrates what happens if a Mixup expression contains + instead of +R. Unlike other languages which extend patterns greedily, Mixup takes each pattern literally and backtracks as needed. To see how this works run:
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newdir –mixup sample1a.mixup –showResult –saveAs foo.labelsNote: -saveAs FILE means saving the labels in some computer–readable format.
When the window appears, highlight source2. Knowing that source2 is any prefix that ends before a -, you can see how this does not work right. Now try running sample1.mixup again and see how it does work right with the +R rather than just the +.
The lessons from these two sample mixup programs are:
- Use
LandRprefixes when you can. - Use non-determinism when you need to.
Take a look at another example, sample2.mixup. Then run:
$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample2.mixup –showResultNow lets take a look at some annotators:
- Open
sample3.mixup(don’t look at it yet). - Run (this will take a while):
$ java –Xmx500M edu.cmu.minorthird.ui.RunMixup –labels small-newsdir -mixup sample3.mixup –showResult- Now take a look at
sample3.mixup:
-
requireasks for a type of annotation to be define; similar to animportstatement in Java. - Annotators are found usually in
$MINORTHID/lib/mixup. - Annotators can be re-defined in
annotators.configwhich is usually in$MINORTHIRD/config/annotators.config.
- When
RunMixupis finished running, we will save the computation to save time later on. To do this, click theSaveAsbutton at the bottom middle of the top left window (you will have to scroll to get there). Note:File->Save Asdoes not work in this case; that is only for serializable objects. - Now pick out some useful tags and save them in
small-newsdir.labels:
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels | grep addToType | cut -d" " -f5 | sort | uniq -c
$ perl -ane "print if $F[4]=~/(description|year|title|source|pubDate|link|extracted|contentArea|body|NP|Name|NNP)/" sample3.labels > small-newsdir.labels
$ java -Xmx500M edu.cmu.minorthird.ui.ViewLabels -labels small-newsdirNote: the order in which MinorThird searches labels for for –labels FOO is as follows:
- Look in the repository.
- Look for directory
FOO. - Look for
FOO.labelsfor markup, and ignore in-line markup. - Look for in-line markup.
Debugging Mixup gives you the ability to edit your labels and your labeling program in parallel. To see how this works, copy saved-handLabeled.labels to handLabeled.labels and do:
$ java –Xmx500M edu.cmu.minorthird.ui.DebugMixup –labels small-newsdir –edir handLabeled.labels –mixup sample5.mixupA window that looks like this will appear (without the highlighting at first):

To highlight extracted companies (which were defined by the Mixup program), select extracted_company from the first pull down menu on the section divider. All the extracted companies will turn yellow (you may have to scroll down a little to find any). Then to view the true companies, which were defined by handLabeled.labels, select true_company from the second pull down menu. All hand labeled companies that were properly extracted by the Mixup program will turn green, all companies that were missed by the Mixup program will turn blue, and false positives will turn red (see above picture for reference).
To edit the labels, click on a document, and click the Import button at the bottom of the window. This will import all the extracted company labels. To correct these labels click the Next button and Delete if it is a false positive. To add a label, highlight the span and click Add. When you are finished labeling a document, click Export. Click Save when you finish.
Some tips:
- On RHS of the center bar, replace
-top-with-body-to focus the window on what you care about. - Replace
-top-with-extracted company-and move the slide to look for extractions-in-context.
When you're close enough with the debugging, you might want to hand the task over to someone else to get more training data. First run the current program:
$ java -Xmx500M edu.cmu.minorthird.ui.RunMixup -labels small-newsdir -mixup sample5.mixup -saveAs sample5.labelsNow take the relevant part of its output, and your hand-labeling results, and merge them:
$ grep extracted_company sample5.labels > labelingTask.labels
$ cat handLabeled.labels >> labelingTask.labelsNow run the labeling tool (which is somewhat stripped down) on the result:
$ java -Xmx500M edu.cmu.minorthird.ui.EditLabels -labels small-newsdir -edit labelingTask.labels -extractedType extracted_company -trueType true_company