Resource-light Approaches to Morphosyntactic Tagging


Supported by the U.S. National Science Foundation: Grant No. 0916280 and Grant No. 1033275
Also supported by  the Grant Agency of the Czech Republic (project ID: P406/10/P328)    

Project Description:

Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with correct tags. Morphological analyzers usually rely on large manually created lexicons. For example, the Czech analyzer  (Hajic 2004) uses a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created this way. We have been developing a method for creating morphological taggers and analyzers of fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise.

Current and past members:

Anna Feldman, Montclair State University
Jirka Hana, Charles University
Katsiaryna Aharodnik (graduate research assistant, Linguistics)
Liubou Shefarevich (undergraduate research assistant, Linguistics)
Aravind Yeluripati (graduate research assistant, CS)
Kurt Keena (undergraduate research assistant, Linguistics)
Marco Chang (undergraduate research assistant, CS)
Michael Reynolds (Undergraduate student assistant)
Katsiaryna Aharodnik (Undergraduate student assistant)
Chris Loeschorn (Undergraduate research assistant, Computer Science)
John Caserta  (Undergraduate research assistant, Computer Science)

ESSLLI 2013 Course:
ESSLLI 2010 Course:

Manuals, Tagsets, Tools, Resources Etc.:

Related Publications: