Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
![]()
Preparing the input corpora for training by Brill's tagger is non-trivial. To give an indication of the work involved consider the Brown corpus. It is all in upper case (although luckily an asterisk has been usefully placed infront of characters that really should be upper case). If a lexicon was extracted directly from the Brown corpus it could not be utilized to tag normal raw text. A program is required to convert the format of Brown to lower case. Next, Brown uses 'combined' tags for words like 'won't' whereas most other corpora split combined words up into constituent parts. For consistency Brown needs to have combined words and their associated tags split into constituent parts. Most corpora are formatted vertically with one word per line. On that line there would typically also be the tag for the word and some reference information. The Brill tagger takes horizontal format so the file needs to be reformatted horizontally before training can take place.
The procedure for training Brill's tagger with a new scheme, once the input corpus has been formatted to a standard as indicated in the previous paragraph, is as follows:
Training Procedure
The following describes the process by which Brill's tagger can be trained to learn the tagging scheme of a tagged corpus. Many files are created during the training process and these are represented in bold type. Only five files are needed by the tagger when training is complete. These essential files are always represented by BOLD CAPITALS. The non-essential files can get large and should be deleted when training is complete. Brill's tagger has its own documentation which should be used in conjunction with this.
Our initial resource is the original tagged corpus. We want to re-use the scheme used to tag the corpus by training Brill's tagger.
Procedure for training
1.As stated above the training tagged corpus (assumed to be called 'training-corpus' in the following) needs to be in a horizontal format, that is with each tag appended to the word with a "/" character. For example,
The/det cat/noun sat/verb on/prep the/det mat/noun ./.
Click here to view 'vert2upenn.prl', the perl code I have written to do the format conversion.
Brill's tagger assumes that "words" and "tags" are a single string of characters without spaces. vert2upenn.prl therefore, converts all spaces found in words or tags to the underscore (`_') character. Also, Brill's tagger assumes that words and tags are delimited by the slash (`/') character so any occurrences of the character are replaced by the forward slash character ('\').
2.We need two training sets. The first will be used to extract a lexicon and to derive rules for tagging words not found within it. The second will be used to learn the re-write "patches". A Perl program to split the input training data in two can be found in the subdirectory Utilities/divide-in-two-rand.prl of the directory where Brill's tagger and associated software is located (~john/corpora/bin/Brill_RBTagger_V1.14). All utility programs mentioned hereafter in the training procedure will assume this same path. Typical sizes are 250,000 words for learning the rules for unknown words (it will take about three days) and 500,000 words for learning the re-write rules (it will take about one day). divide-in-two-rand.prl calls the two training sets: training-corpus-1 and training-corpus-2.
UNIX% cat training-corpus | divide-in-two-rand.prl training-corpus-1 training-corpus-2
3.Now, Learn the rules to predict the most likely tag for unknown words. This process uses training-corpus-1 and all available untagged text. To convert the Brill-style tagged corpus into raw training data use the program Utilities/tagged-to-untagged.prl:
UNIX% cat training-corpus | tagged-to-untagged.prl > untagged-corpus
4.Create BIGWORDLIST, a list of all words occurring in untagged-corpus sorted by decreasing frequency. Use the command:
UNIX% cat untagged-corpus | Utilities/wordlist-make.prl | sort +1 -rn | awk '{ print $1}' > BIGWORDLIST
5.Create smallwordtaglist, a list of the number of times a word is tagged with a tag in training-corpus-1. Again it is sorted into frequency order. Use the command:
UNIX% cat training-corpus-1 | Utilities/word-tag-count.prl | sort +2 -rn > smallwordtaglist
6.Create BIGBIGRAMLIST, a list of word pairs occurring in untagged-corpus by using the command:
UNIX% cat untagged-corpus | Utilities/bigram-generate.prl | awk '{ print $1,$2}' > BIGBIGRAMLIST
7.Now, the input files needed by the learner of rules for new words have been created. To run the lexical rule learner type:
Learner_Code/unknown-lexical-learn.prl BIGWORDLIST smallwordtaglist BIGBIGRAMLIST 300 LEXRULEOUTFILE.
The 300 means to only use bigram contexts where at least one of the two words is amongst the 300 most frequent words. This value can be changed if desired.
This program is the slowest of the programs to run. As it exhaustively checks many thousands of candidate rules at each iteration it can take weeks to learn around a hundred rules.
8.Create the file training.lexicon where each word is listed with the tags it was found with in training-corpus-1. No frequency information is retained except that the most frequent tag is listed first. The command to so this is:
UNIX% cat training-corpus-1 | Utilities/make-restricted-lexicon.prl > training.lexicon
9.To make the lexicon that will be used in the final trained tagger give this command:
UNIX% cat training-corpus | Utilities/make-restricted-lexicon.prl > FINAL.LEXICON
10.Now, convert the second half of the tagged corpus into raw text. Use:
UNIX% cat training-corpus-2 | Utilities/tagged-to-untagged.prl > untagged-corpus-2
11.Next we use Brill's tagger to produce a dummy-tagged-corpus based on the rules acquired so far. The tagger must be run from the directory that contains it. Use the command:
UNIX% tagger training.lexicon untagged-corpus-2 BIGBIGRAMLIST LEXRULEOUTFILE /dev/null -w BIGWORDLIST -i dummy-tagged-corpus > /dev/null.
12.To learn the contextual rules use:
Bin_and_Data/contextual-rule-learn training-corpus-2 dummy-tagged-corpus CONTEXT-RULEFILE training.lexicon
This stage can also take a long time to execute. It may take a week to learn a sufficient number of patching rules.
Brill's tagger has now been trained and can be used to apply the learned annotation scheme to raw text.
This site developed and maintained by Eric Atwell (eric@comp.leeds.ac.uk)