Probabilistic Models for Alignment of Etymological Data

Lecturer : 
Roman Yangarber
Event type: 
HIIT seminar
Event time: 
2011-04-08 10:15 to 11:00
Kumpula Exactum C222
Etymology is the study of origins of words and
relationships and connections among languages.  It involves
many sub-problems, including finding cognates or sets of
genetically related words across a language family,
discovering rules of regular sound correspondence among
the languages, building phylogenetic trees, and
reconstructing hidden data, including proto-languages.  We
focus mainly on the regularity of sound correspondence,
but address some of the others as well.  Our models try to
align etymological data, or find the best alignment at the
sound level, given a set of etymological data.  We aim to
devise methods that are as objective as possible, making
no a priori assumptions---e.g., no preference for
vowel-vowel or consonant-consonant alignments.  One of the
goals is to measure the quality of the data sets, in terms
of their internal consistency.  We introduce a MDL-based
initial model and present several extensions.  We also
discuss several ways for evaluating the results,
qualitatively and quantitatively.  The models are
evaluated on data from the Uralic family (which includes
Finnish, Estonian and Hungarian, among other languages).

(Work done under Academy Project Uralink.
Joint work with Hannes Wettig.)

--Matti Järvisalo

