Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)


School of Computer Studies Home PagePrevious PageUp a levelNext Page


Tokenisation rules for AMALGAM's multi-tagger

The tokeniser is required by the multi-tagger if the text to be annotated has not already been tokenised (which really ought to be done by hand - or at the very least the tokenised output should be verified by a human reader). The tokeniser does the following:

  • Produces output having one sentence per line by spitting at semi-colons, colons, exclamation marks, question marks and full stops (although see the next rule for problems recognising full stops). Blank lines in the input stream are ignored.
  • Split full stops from the ends of words. Abbreviations cause problems here. If an abbreviation is recognised (by searching through a list compiled from several corpora) the full stop is left attached to the word. This means that unrecognised abbreviations may cause the tokeniser to consider it to be the last word in the sentence meaning that it will split one sentence incorrectly in to two. Some morphological rules are applied to recognise abbreviations not previously seen. Acronymns that alternate in sequence between alpha-numeric character and full stop (B.B.C., Japan-U.S. and i.e. for example) will keep the final full stop as will words found to have at least one number and a recognised measurement at the end (for example 49ft. and 320-yd.).
  • A further complication arises for the situation when a word is an abbreviation and the last word of a sentence. The tokeniser looks for this by looking at the next word. If it is capitalised then the current abbreviated word is assumed to be the last of the current sentence and the following capitalised word is taken to be the first word of the next sentence. However, for titular abbreviations such as Mr. and Cmdr. we would expect the next word to be capitalised. Further, it is almost certain that a titular abbreviation is not the last word in the sentence. The tokeniser checks to see if the current word is in the list of titular abbreviations and, if it is, it prevents the "start a new sentence if the word following an abbreviation is capitalised" rule from operating.
  • Convert words at the start of the sentence to lower case unless the words are always capitalised (this is determined by looking in a lexicon). This rule is not failsafe.
  • Combined words are split into constituents. I've becomes I + 've and shan't becomes shall + n't. Some morphological rules can be applied to try to deal with unrecognised combined words such as always splitting off the n't from a word. However, there are bound to be rare or slang combinations that will not be found by the tokeniser and these should be split by hand.
  • Generally, the taggers expect 's endings to be split off when the 's is part of a contraction but not when the 's is acting as a genitive marker. Normally a contracted 's would be an is or has but could also be as in contractions such as well's or soon's or us in let's. Deciding whether 's is a contraction of genitive marker can be difficult to achieve automatically. A guess is made by looking at the next word. If it is recognised as a word that can follow an is (usually the word will be a verb, adverb or preposition) the 's is assumed to be part of a contraction and is stripped from the word. The wordlist used was formed by filtering out words from the Brown lexicon that had been tagged with a vetted subset of tags.
  • Quotes are also problematic as the quote character on a word like Jones' could be a genitive marker or part of a quoted expression. The tokeniser counts the number of opening and closing quotes in the current sentence. If there are more opening quotes the ' character is assumed to be a quote character and is split form the word. Otherwise it is left attached. The double quote character is always split form the word. Normally all quotes are removed form the start of a word. However, there are some words that begin with an apostrophe such as 'ello. There is a list of such words that will not have an opening quote removed.
  • Split off other punctuation such as parenthesis.
  • Some of these rules may apply in combination for example "Don't!") needs to be split into six items: " + Do + n't + ! + " + ) and the capital D needs to be converted to lower case.
  • Already tokenised data will be left as unaltered as possible. However, this cannot be guaranteed so if the AMALGAM multi-tagger is to be used on already tokenised data it should not be passed through the tokeniser.

  • School of Computer Studies Home PagePrevious PageUp a levelNext Page


    This site developed and maintained by Eric Atwell (