This is a temporary page set up to make accessible my Austroasiatic lexicostatistics for distribution and comments.



Lexicostatistics is a much abused and misunderstood technique. In my experience lexicostatistics is quite good at:


Consequently lexicostatistical analyses can have an important role in interpreting the evidence of comparative phonology and grammar in modeling language family history.


54 AA language Analysis

The data and analysis presented here reflects work I carried out in January 2010. 54 languages, reflecting 13 putative branches, were selected and analysed. Some of the data and cognate score were taken from Peiros, Ilia J. 2004. Genetičeskaja klassifikacija avstroaziatskix jazykov. Moskva: Rossijskij gosudarstvennyj gumanitarnyj universitet (doktorskaja dissertacija).


Cognate scores were coded according to the table at this link. The data was then processed using Jacque Guy's Glotto software, which can be downloaded freely from This produced the matrix of percentages which can be viewed here. The software automatically generates the following dendrogram/Sammbaum.


The currious fact is that there is an underlying trend for all branches to show higher than expected scores in respect of Katuic and Bahnaric, declining with geographic distance from Katuic and Bahnaric (except for Munda that shows significant Katuic-Bahnaric isoglosses despite great distance, compared to – say –Nicobarese). This was actually first pointed out by Franklin Huffam in 1978.


My preliminary interpretation, consistent with Huffman, is that Austroasiatic languages dispersed quickly after a long period of being in proximity and contact, centred along the Mekong valley roughly where Katuic and Bahnaric are spoken today. This implies a rather flat family tree emerging from a dialect chain. We can see an indication in this in the neighbour net tree here created by Russell Gray & Simon Greenhill.


54 AA language Analysis - November 2010 revision

Click here to download the most revision to the data/cognate assignments.


Comments and corrections on my data and analyses are invited. Please email to me at


Previous (2009) Analyses

Several earlier trials were conducted. Through April-May I worked through data sets of 24, 26, 28 and 30 languages. The data and cognate scores for the latter is at this link. The previous trials are sub-sets of the latter. The neighbour net created from this data is here.


Back to homepage