Automatic
Mapping Among Lexico-Grammatical Annotation Models (AMALGAM)
AMALGAM TAGGER - TOKENISATION





AMALGAM HOMEPAGE |
PREVIOUS PAGE | UP
A LEVEL | NEXT PAGE
Tokenisation rules for AMALGAM's multi-tagger
The tokeniser is required by the multi-tagger if the text to be annotated
has not already been tokenised (which really ought to be done by hand -
or at the very least the tokenised output should be verified by a human
reader). The tokeniser does the following:
Produces output having one sentence per line by spitting at semi-colons,
colons, exclamation marks, question marks and full stops (although see
the next rule for problems recognising full stops). Blank lines in the
input stream are ignored.
Split full stops from the ends of words. Abbreviations cause problems
here. If an abbreviation is recognised (by searching through a list compiled
from several corpora) the full stop is left attached to the word. This
means that unrecognised abbreviations may cause the tokeniser to consider
it to be the last word in the sentence meaning that it will split one sentence
incorrectly in to two. Some morphological rules are applied to recognise
abbreviations not previously seen. Acronymns that alternate in sequence
between alpha-numeric character and full stop (B.B.C., Japan-U.S. and i.e.
for example) will keep the final full stop as will words found to have
at least one number and a recognised measurement at the end (for example
49ft. and 320-yd.).
A further complication arises for the situation when a word is an abbreviation
and the last word of a sentence. The tokeniser looks for this by looking
at the next word. If it is capitalised then the current abbreviated word
is assumed to be the last of the current sentence and the following capitalised
word is taken to be the first word of the next sentence. However, for titular
abbreviations such as Mr. and Cmdr. we would expect the next word to be
capitalised. Further, it is almost certain that a titular abbreviation
is not the last word in the sentence. The tokeniser checks to see if the
current word is in the list of titular abbreviations and, if it is, it
prevents the "start a new sentence if the word following an abbreviation
is capitalised" rule from operating.
Convert words at the start of the sentence to lower case unless the
words are always capitalised (this is determined by looking in a lexicon).
This rule is not failsafe.
Combined words are split into constituents. I've becomes I + 've and
shan't becomes shall + n't. Some morphological rules can be applied to
try to deal with unrecognised combined words such as always splitting off
the n't from a word. However, there are bound to be rare or slang combinations
that will not be found by the tokeniser and these should be split by hand.
Generally, the taggers expect 's endings to be split off when the 's
is part of a contraction but not when the 's is acting as a genitive marker.
Normally a contracted 's would be an is or has but could also be as in
contractions such as well's or soon's or us in let's. Deciding whether
's is a contraction of genitive marker can be difficult to achieve automatically.
A guess is made by looking at the next word. If it is recognised as a word
that can follow an is (usually the word will be a verb, adverb or preposition)
the 's is assumed to be part of a contraction and is stripped from the
word. The wordlist used was formed by filtering out words from the Brown
lexicon that had been tagged with a vetted subset of tags.
Quotes are also problematic as the quote character on a word like Jones'
could be a genitive marker or part of a quoted expression. The tokeniser
counts the number of opening and closing quotes in the current sentence.
If there are more opening quotes the ' character is assumed to be a quote
character and is split form the word. Otherwise it is left attached. The
double quote character is always split form the word. Normally all quotes
are removed form the start of a word. However, there are some words that
begin with an apostrophe such as 'ello. There is a list of such words that
will not have an opening quote removed.
Split off other punctuation such as parenthesis.
Some of these rules may apply in combination for example "Don't!")
needs to be split into six items: " + Do + n't + ! + " + ) and
the capital D needs to be converted to lower case.
Already tokenised data will be left as unaltered as possible. However,
this cannot be guaranteed so if the AMALGAM multi-tagger is to be used
on already tokenised data it should not be passed through the tokeniser.



AMALGAM HOMEPAGE |
PREVIOUS PAGE | UP
A LEVEL | NEXT PAGE
This site developed and maintained by Eric Atwell
(eric@comp.leeds.ac.uk)