Automatic identification of language varieties: The case of Portuguese

Marcos Zampieri, Binyam Gebre; Proceedings of KONVENS 2012 (Main track: poster presentations), pp. 233-237, September 2012.

Abstract

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. State-of-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classification of two written varieties of Portuguese: European and Brazilian. Results reached 0.998 for accuracy using character 4-grams.

[pdf] [bibtex]