rslm

Resource-light Approaches to Morphosyntactic Tagging

This page was last updated in 2013

Supported by the U.S. National Science Foundation: Grant No. 0916280 and Grant No. 1033275
Also supported by the Grant Agency of the Czech Republic (project ID: P406/10/P328)

Project Description:

Morphological analysis, tagging and lemmatization are essential for many Natural Language Processing (NLP) applications of both practical and theoretical nature. Modern taggers and analyzers are very accurate. However, the standard way to create them for a particular language requires substantial amount of expertise, time and money. A tagger is usually trained on a large corpus (around 100,000+ words) annotated with correct tags. Morphological analyzers usually rely on large manually created lexicons. For example, the Czech analyzer (Hajic 2004) uses a lexicon with 300,000+ entries. As a result, most of the world languages and dialects have no realistic prospect for morphological taggers or analyzers created this way. We have been developing a method for creating morphological taggers and analyzers of fusional languages without the need for large-scale knowledge- and labor-intensive resources for the target language. Instead, we rely on (i) resources available for a related language and (ii) a limited amount of high-impact, low-cost manually created resources. This greatly reduces cost, time requirements and the need for (language-specific) linguistic expertise.

Current and past members:

Anna Feldman, Montclair State University
Jirka Hana, Charles University
Katsiaryna Aharodnik (graduate research assistant, Linguistics)
Liubou Shefarevich (undergraduate research assistant, Linguistics)
Aravind Yeluripati (graduate research assistant, CS)
Kurt Keena (undergraduate research assistant, Linguistics)
Marco Chang (undergraduate research assistant, CS)
Michael Reynolds (Undergraduate student assistant)
Katsiaryna Aharodnik (Undergraduate student assistant)
Chris Loeschorn (Undergraduate research assistant, Computer Science)
John Caserta (Undergraduate research assistant, Computer Science)

ESSLLI 2013 Course: http://ufal.mff.cuni.cz/~hana/teaching/2013-esslli/
ESSLLI 2010 Course: http://chss.montclair.edu/~feldmana/esslli10/

Manuals, Tagsets, Tools, Resources Etc.:

Tools, language datasets, documentation: https://bitbucket.org/jhana/morph
Jirka Hana and Anna Feldman. 2010. Draft Manual for Quick Resource Creation [pdf]
The Russian Positional Tagset
Jirka Hana. Lexical Annotation Workbench (LAW)
Jirka Hana and Anna Feldman. 2008. Manual for Morphological Annotation with Positional Tags. Published online.

Related Publications:

Katsiaryna Aharodnik, Marco Chang, Anna Feldman and Jirka Hana. 2013. Automatic identification of learners' language background based on their writing in Czech. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) 2013.[pdf]
Rosen A., J. Hana, B. Štindlová, A. Feldman. 2013. Evaluating and automating the annotation of a learner corpus. In Language Resources and Evaluation 47 (April). Springer. DOI: 10.1007/s10579-013-9226-3 [pdf]
Jirka Hana and Anna Feldman. 2012. Resource-light approaches to computational morphology. Part I: Monolingual Approaches . In Language and Linguistics Compass Journal (Computational Linguistics Section). Vol.6, Issue 10, pp. 622-634, Blackwell.
Jirka Hana, Boris Lehecka, Anna Feldman, Alena Cerná, Karel Oliva. 2012. Building a corpus of Old Czech. In Adaptation of Language Resources and Tools for Processing Cultural Heritage Objects Workshop associated with the LREC 2012 Conference (21-27 May 2012) [pdf]
Anna Feldman. November 2012. Review of Roark & Sproat's Computational approaches to morphology and syntax. In Word Structure, 5:2, Edinburgh University Press.
Jirka Hana, Anna Feldman, and Katsiaryna Aharodnik. 2011. A low-budget tagger for Old Czech. In Proceedings of the 5th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011) held in conjunction with the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT 2011).[pdf]
Anna Feldman and Jirka Hana. 2010. A Resource-light Approach to Morphosyntactic Tagging. Language and Computers 70: Studies in Practical Linguistics, eds. Christian Mair, Charles F. Meyer & Nelleke Oostdijk, Rodopi Press, Amsterdam-New York. XIV, 185 pp. ISBN: 978-90-420-2768-8; http://www.rodopi.nl/senj.asp?BookId=LC+70

Jirka Hana and Anna Feldman. 2010. Challenges of Cheap Resource Creation. In Proceedings of the 4th Linguistic Annotation Workshop held in conjunction with ACL 2010.[pdf]
Jirka Hana and Anna Feldman. 2010. A New Positional Tagset System for Russian. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010). [pdf]
Serge Sharoff, Mikhail Kopotev, Tomac Erjavec, Anna Feldman and Dagmar Divjac. 2008. Designing and Evaluating a Russian Tagset. In Proceedings of the sixth international conference on Language Resources and Evaluation (LREC 2008), Marrakech (Morocco). [pdf]
Anna Feldman. 2008. Tagset Design, Inflected Languages, and N-gram Tagging. The Linguistics Journal, 3(1), pp. 155-177, Time Taylor International, ISSN: 17182298 ISSN Print: 1718-2301.
Anna Feldman, Jirka Hana, and Chris Brew. 2006. "A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources". In Proceedings of the fth international conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy. [pdf]
Jirka Hana, Anna Feldman, Luiz Amaral, and Chris Brew. 2006. "Tagging Portuguese with a Spanish Tagger Using Cognates". In Proceedings of the Workshop on Cross-language Knowledge Induction hosted in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006). Trento, Italy. pp. 33-40. [pdf]
Anna Feldman, Jirka Hana, and Chris Brew. 2006. "Experiments in Morphological Annotation Transfer". In Proceedings of Computational Linguistics and Intelligent Text Processing, CICLing, A. Gelbukh (editor), Lecture Notes in Computer Science, Springer-Verlag, 2006. pp. 41-50. [pdf]
Anna Feldman, Jirka Hana, and Chris Brew. 2005. "Buy One, Get One Free or What to Do When Your Linguistic Resources are Limited". In Proceedings of the third international seminar on Computer Treatment of Slavic and East-European Languages (Slovko 2005). Bratislava, Slovakia.
Jirka Hana, Anna Feldman, and Chris Brew. 2004. "A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources". In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp.222-229. [pdf]
Jirka Hana and Anna Feldman. 2004. "Portable Language Technology: Russian via Czech." In Proceedings of the 2004 Midwest Computational Linguistics Colloquium. Bloomington, Indiana. [pdf]