statistics – Replicated Typo

Higgs Boson and Big Data

It’s not about cultural evolution, but I think most people who have even a passing interest in science are gearing up to welcome Higgs Boson to the elementary particle party. Anyway, here’s a nicely put together video on explaining what the Higgs Boson is and why its discovery is significant:

The Higgs Boson Explained from PHD Comics on Vimeo.

There’s also a more general point about needing to gather a huge amount of data (15 petabytes a year — enough to fill more than 1.7 million dual-layer DVDs a year) to find the very small effect size that is predicted for the Higgs Boson. In itself, data of this magnitude will likely come with significantly more noise, which means physicists have needed to develop well-defined statistical methods (they even have their own statistics committee). It really is a massive achievement for modern science.

The power of diversity: New Scientist recognises the growing work on social structure and linguistic structure

A feature article in last week’s New Scientist asks why there is so much linguistic diversity present in the world, and what are the forces that drive it. The article reads like a who’s who of the growing field of language structure and social structure: Mark Pagel, Gary Lupyan, Quentin Atkinson, Robert Munroe, Carol and Melvin Ember, Dan Dediu and Robert Ladd, Stephen Levinson (click on the names to see some Replicated Typo articles about their work). This is practically as close as my subject will come to having a pull-out section in Vanity Fair. Furthermore, it recognises the weakening grip of Chomskyan linguistics.

Commentators have already gotten hung-up on whether English became simplified before or after spreading, but this misses the impact of the article: There is an alternative approach to linguistics which looks at the differences between languages and recognises social factors as the primary source of linguistic change. Furthermore, these ideas are testable using statistics and genetic methods. It’s a pity the article didn’t mention the possibility of experimental approaches, including Gareth Roberts’ work on emerging linguistic diversity and work on cultural transmission using the Pictionary paradigm (Simon Garrod, Nick Fay, Bruno Gallantucci, see here and here).

David Robson (2011). Power of Babel: Why one language isn’t enough New Scientist, 2842 Online

Linguistic diversity and traffic accidents

I was thinking about Daniel Nettle’s model of linguistic diversity which showed that linguistic variation tends to decline even with a small amount of migration between communities. I wondered if statistics about population movement would correlate with linguistic diversity, as measured by the Greenberg Diversity Index (GDI) for a country (see below). However, this is a cautionary tale about obsession and use of statistics. (See bottom of post for link to data).

Continue reading “Linguistic diversity and traffic accidents”

A random walk model of linguistic complexity

EDIT: Since writing this post, I have discovered a major flaw with the conclusion which is described here.

One of the problems with large-scale statistical analyses of linguistic typologies is the temporal resolution of the data. Because we only typically have single measurements for populations, we can’t see the dynamics of the system. A correlation between two variables that exists now may be an accident of more complex dynamics. For instance, Lupyan & Dale (2010) find a statistically significant correlation between a linguistic population’s size and its morphological complexity. One hypothesis is that the language of larger populations are adapting to adult learners as they comes into contact with other languages. Hay & Bauer (2007) also link demography with phonemic diversity. However, it’s not clear how robust these relationships are over time, because of a lack of data on these variables in the past.

To test this, a benchmark is needed. One method is to use careful statistical controls, such as controlling for the area that the language is spoken in, the density of the population etc. However, these data also tend to be synchronic. Another method is to compare the results against the predictions of a simple model. Here, I propose a simple model based on a dynamic where cultural variants in small populations change more rapidly than those in large populations. This models the stochastic nature of small samples (see the introduction of Atkinson, 2011 for a brief review of this idea). This model tests whether chaotic dynamics lead to periods of apparent correlation between variables. Source code for this model is available at the bottom.

Continue reading “A random walk model of linguistic complexity”

Categorising languages through network modularity

Today I’ve been learning more about network structure (from Cris Moore) and I’ve applied my poor understanding and overconfidence to find language families from etymology data!

Here’s what I understand so far (see Clauset, Moore, & Newman, 2008): The modularity of a network is a measure of how many ‘communities’ it has. An optimal modularity will split the graph to maximise the average degree within modules or clusters. You can search all the possible clusterings to find this optimum. I’m still hazy on how this is actually done, and you can extend this to find hierarchies like phylogenetics, but without some assumptions. Luckily, there’s a network analysis program called gephi that does this automatically!

Continue reading “Categorising languages through network modularity”

Academic Networking

Who are the movers and shakers in your field? You can use social network theory on your bibliographies to find out:

Today I learned about some studies looking at social networks constructed from bibliographic data (from Mark Newman, see Newman 2001 or Said et al. 2008) . Nodes on a graph represent authors and edges are added if those authors have co-authored a paper.

I scripted a little tool to construct such a graph from bibtex files – the bibliographic data files used with latex. The Language Evolution and Computation Bibliography – a list of the most relevant papers in the field – is available in bibtex format.

You can look at the program using the online Academic Networking application that I scripted today, or upload your own bibtex file to find out who the movers and shakers are in your field. Soon, I hope to add an automatic graph-visualisation, too.

Continue reading “Academic Networking”

Statistics and Symbols in Mimicking the Mind

MIT recently held a symposium on the current status of AI, which apparently has seen precious little progress in recent decades. The discussion, it seems, ground down to a squabble over the prevalence of statistical techniques in AI and a call for a revival of work on the sorts of rule-governed models of symbolic processing that once dominated much of AI and its sibling, computational linguistics.

Briefly, from the early days in the 1950s up through the 1970s both disciplines used models built on carefully hand-crafted symbolic knowledge. The computational linguists built parsers and sentence generators and the AI folks modeled specific domains of knowledge (e.g. diagnosis in elected medical domains, naval ships, toy blocks). Initially these efforts worked like gang-busters. Not that they did much by Star Trek standards, but they actually did something and they did things never before done with computers. That’s exciting, and fun.

In time, alas, the excitement wore off and there was no more fun. Just systems that got too big and failed too often and they still didn’t do a whole heck of a lot.

Then, starting, I believe, in the 1980s, statistical models were developed that, yes, worked like gang-busters. And these models actually did practical tasks, like speech recognition and then machine translation. That was a blow to the symbolic methodology because these programs were “dumb.” They had no knowledge crafted into them, no rules of grammar, no semantics. Just routines the learned while gobbling up terabytes of example data. Thus, as Google’s Peter Norvig points out, machine translation is now dominated by statistical methods. No grammars and parsers carefully hand-crafted by linguists. No linguists needed.

What a bummer. For machine translation is THE prototype problem for computational linguistics. It’s the problem that set the field in motion and has been a constant arena for research and practical development. That’s where much of the handcrafted art was first tried, tested, and, in a measure, proved. For it to now be dominated by statistics . . . bummer.

So that’s where we are. And that’s what the symposium was chewing over.

Continue reading “Statistics and Symbols in Mimicking the Mind”

The end of universals?

Woah, I just read some of the responses to Dunn et al. (2011) “Evolved structure of language shows lineage-specific trends in word-order universals” (language log here, Replicated Typo coverage here). It’s come in for a lot of flack. One concern raised at the LEC was that, considering an extreme interpretation, there may be no affect of universal biases on language structure. This goes against Generativist approaches, but also the Evolutionary approach adopted by LEC-types. For instance, Kirby, Dowman & Griffiths (2007) suggest that there are weak universal biases which are amplified by culture. But there should be some trace of universality none the less.

Below is the relationship diagram for Indo-European and Uto-Aztecan feature dependencies from Dunn et al.. Bolder lines indicate stronger dependencies. They appear to have different dependencies- only one is shared (Genitive-Noun and Object-Verb).

However, I looked at the median Bayes Factors for each of the possible dependencies (available in the supplementary materials). These are the raw numbers that the above diagrams are based on. If the dependencies’ strength rank in roughly the same order, they will have a high Spearman rank correlation.

Spearman Rank Correlation	Indo-European	Austronesian
Uto-Aztecan	0.39, p = 0.04	0.25, p = 0.19
Indo-European		-0.13, p = 0.49

Spearman rank correlation coefficients and p-values for Bayes Factors for different dependency pairs in different language families. Bantu was excluded because of missing feature data.

Although the Indo-European and Uto-Aztecan families have different strong dependencies, have similar rankings of those dependencies. That is, two features with a weak dependency in an Indo-European language tend to have a weak dependency in Uto-Aztecan language, and the same is true of strong dependencies. The same is true to some degree for Uto-Aztecan and Austronesian languages. This might suggest that there are, in fact, universal weak biases lurking beneath the surface. Lucky for us.

However, this does not hold between Indo-European and Austronesian language families. Actually, I have no idea whether a simple correlation between Bayes Factors makes any sense after hundreds of computer hours of advanced phylogenetic statistics, but the differences may be less striking than the diagram suggests.

UPDATE:

As Simon Greenhill points out below, the statistics are not at all conclusive. However, I’m adding the graphs for all Bayes Factors (these are made directly from the Bayes Factors in the Supplementary Material):

Austronesian: Bantu:

Indo-European: Uto-Aztecan:

Michael Dunn,, Simon J. Greenhill,, Stephen C. Levinson, & & Russell D. Gray (2011). Evolved structure of language shows lineage-specific trends in word-order universals Nature, 473, 79-82

Mapping Linguistic Phylogeny to Politics

In a recent article covered in NatureNews in Societes Evolve in Steps, Tom Currie of UCL, and others, like Russell Gray of Auckland, use quantitative analysis of the Polynesian language group to plot socioanthropological movement and power hierarchies in Polynesia. This is based off of previous work, available here, which I saw presented at the Language as an Evolutionary Systemconference last July. The article claims that the means of change for political complexity can be determined using linguistic evidence in Polynesia, along with various migration theories and archaeological evidence.

I have my doubts.

Note: Most of the content in this post is refuted wonderfully in the comment section by one of the original authors of the paper. I highly recommend reading the comments, if you’re going to read this at all – that’s where the real meat lies. I’m keeping this post up, finally, because it’s good to make mistakes and learn from them. -Richard

§§

I had posted this already on the Edinburgh Language Society blog. I’ve edited it a bit for this blog. I should also state that this is my inaugural post on Replicated Typo; thanks to Wintz’ invitation, I’ll be posting here every now and again. It’s good to be here. Thanks for reading – and thanks for pointing out errors, problems, corrections, and commenting, if you do. Research blogging is relatively new to me, and I relish this unexpected chance to hone my skills and learn from my mistakes. (Who am I, anyway?) But without further ado:

I have my doubts. The talk that was given by Russell Gray suggested that there were still various theories about the migratory patterns of the Polynesians – in particular, where they started from. What his work did was to use massive supercomputers to narrow down all of the possibilities, by using lexicons and charting their similarities. The most probable were then recorded, and their statistical probability indicated what was probably the course of action. This, however, is where the ability for guessing ends. Remember, this is massive quantificational statistics. If one has a 70% probability chance of one language being the root of another, that isn’t to say that that language is the root, much less that the organisation of one determines the organisation of another. But statistics are normally unassailable – I only bring up this disclaimer because there isn’t always clear mapping between language usage and migration.

Continue reading “Mapping Linguistic Phylogeny to Politics”

Experiments in Communication pt 1: Artificial Language Learning and Constructed Communication Systems

Much of recent research in linguistics has involved the use of experimentation to directly test hypotheses by comparing and contrasting real-world data with that of laboratory results and computer simulations. In a previous post I looked at how humans, non-human primates, and even non-human animals are all capable of high-fidelity cultural transmission. Yet, to apply this framework to human language, another set of experimental literature needs to be considered, namely: artificial language learning and constructed communication systems.

Continue reading “Experiments in Communication pt 1: Artificial Language Learning and Constructed Communication Systems”