Categorising languages through network modularity

Today I've been learning more about network structure (from Cris Moore) and I've applied my poor understanding and overconfidence to find language families from etymology data!

Here's what I understand so far (see Clauset, Moore, &  Newman, 2008):  The modularity of a network is a measure of how many 'communities' it has.  An optimal modularity will split the graph to maximise the average degree within modules or clusters.  You can search all the possible clusterings to find this optimum.  I'm still hazy on how this is actually done, and you can extend this to find hierarchies like phylogenetics, but without some assumptions.  Luckily, there's a network analysis program called gephi that does this automatically!

A while ago, I generated a graph of word borrowing by looking at an English etymology dictionary. I took about 5,000 words from and counted the number of links between languages.  For instance, the entry for 'pace' is:

"a step," late 13c., from O.Fr. pas, from L. passus "a step," lit. pp. of pandere "to stretch (the leg), spread out," from PIE *pat-no-

So I'd add links from Modern English to Old French, from Old French to Latin and from Latin to PIE (Proto-Indo European).  Nodes are languages (past and present) and edges are added from source languages to borrowing languages.  Given this data, we can categorise the languages based on how they are connected.  As I point out in the original post, the network is more graph-like than tree-like - there is more than one route from ancient languages to modern English.  Here are some stats:

Scale-free test (KS = 2.6904, p-value < 0.00001)
Nodes: 49
Edges: 182
Transitivity:  0.176
Shortest Path:  1.32
Weighted Shortest Path:  2.15  (weights based on number of words borrowed)
Assortitivity:  -0.375

The network is scale-free, suggesting that there are large hubs.  Interestingly, the assortitivity is negative, suggesting a network more like an internet-like network than a social network.  However, these statistics may be affected by the fact that the data is from an English perspective.

I used gephi to categorise the graphs.  Here's the graph produced by gephi (trimmed weak nodes, Modern English is 'Mod.Eng' in the top left) :


Although I'd have to quantify this more carefully, it looks like the algorithm has classified the nodes so that the red ones are Germanic languages (Norse, German, Old high German etc.) and the green ones are Romance languages (Latin, Italian, French).  The blue nodes look like an 'other' class, but also have some features in common - Egyptian, Sanskrit, Persian, Arabic.  However, there are clear odd-ones-out:  Celtic being Germanic, for example.  Actually, the algorithm is non-deterministic, so this is just one of the categorisations it found.  I'm going to have to go back and do this analysis properly and check the quality of the original data.  Adding etymology data from another language would be good, too.

As a side note,  I've also been learning how to do phylogenetic analyses using SplitsTree and Neighbour Net.  I thought I'd do a little experiment:  Construct a graph of past and present languages, with the distance measure being the number of words borrowed from one language to another.

Below is the diagram that comes out:

Actually, I can't make any sense of this at all. I suspect that the data is too sparse. If I was to make a grand conclusion, it might be that some parts of the inheritance of English is tree-like and some is graph-like, and this might vary over time.  This might affect the assumptions you can make about cultural inheritance.  Still, there's some evidence of network-like structure there. And it looks pretty.

Aaron Clauset, Cristopher Moore, & M. E. J. Newman (2008). Hierarchical structure and the prediction of missing links in networks Nature 453, 98 - 101 (2008) arXiv: 0811.0484v1