Morphological analysis, tagging
and lemmatization are essential for many Natural Language Processing
(NLP) applications of both practical and theoretical nature. Modern
taggers and analyzers are very accurate. However, the standard way to
create them for a
particular language requires substantial amount of expertise, time and
money. A tagger is usually trained
on a large corpus (around 100,000+ words) annotated with correct tags.
Morphological analyzers
usually rely on large manually created lexicons. For example, the Czech
analyzer (Hajic 2004) uses
a lexicon with 300,000+ entries. As a result, most of the world
languages and dialects have no realistic
prospect for morphological taggers or analyzers created this way.
We have been developing a method for creating morphological taggers and
analyzers of fusional languages without the need for large-scale
knowledge- and labor-intensive resources for the target language.
Instead, we rely on (i) resources available for a related language and
(ii) a limited amount of
high-impact, low-cost manually created resources. This greatly reduces
cost, time requirements and the need for (language-specific) linguistic
expertise.
|