A recent paper in PLOS ONE by Caleb Everett looks at whether geography can affect phoneme inventories. Everett finds that language communities that live at higher altitudes are more likely to have ejective sounds in their phoneme inventories. One of Everett’s hypotheses is that the lower air pressure at higher altitudes makes ejectives easier to produce, and drier climates at higher altitudes “may help to mitigate rates of water vapor loss through exhaled air”. While I don’t have anything against this kind of theory in principle, and I’m not going to comment on the plausibility of this theory, I wanted to check whether the stats held up.
This sounds suspiciously like one of our spurious correlations – links between cultural features that come about by accidents of cultural history rather than being causally related. Although Everett notes that the tests he uses include languages from many language families, there’s no real control for historical descent. James and I have also submitted a paper to PLOS ONE about this phenomenon more generally, and we suggest a few statistical tests that should be applied to this kind of claim. These include comparing the correlation of the variables of interest with similar variables that you don’t think are related, and controlling for historical descent by using, for example, phylogenetic generalised least squares. In this post, I apply these tests.
First, I test whether the link between ejectives and elevation is stronger than the link between elevation and many other linguistic features. I ran a correlation for each variable in the WALS database. Elevation (altitude) does indeed significantly predict the presence of ejectives. Surprisingly, only 2 other variables resulted in stronger predictors of elevation. That is, the presence of ejectives is in the top 1.4% of variables for predicting elevation. The presence of ejectives resulted in a correlation that was significantly stronger than 94.4% of variables (above 1.98 standard deviations). This is surprisingly good news for Everett!
Below is a histogram of the results (F-score of the model fit), with a red line indicating the strength of the ejectives variable :
The linguistic variables that gave better results than ejectives were the Order of Object and Verb and the Relationship between the Order of Object and Verb and the Order of Adjective and Noun. I can’t think of a good reason that these would be linked. See below:
The next test involved controlling for common descent of languages. I built a phylogenetic tree from the linguistic classifications from the Ethnologue. We’re predicting elevation (continuous) given the presence of ejectives (discrete), so we’ll use a phylogenetic generalised least squares test (you can learn more about doing this at the excellent tutorials by Charles Nunn and others, here). This weights the observations by how related they are, given a particular model of trait evolution. The elevation variable has a strong phylogenetic signal (Pagel’s lambda = 0.3, sig. > 0, p<0.00001; sig. different from 1, p<0.00001), so we’ll use Pagel’s covarience matrix.
Surprisingly, the correlation holds up, even when controlling for phylogeny (491 languages, df = 419, residual df = 489, estimated lambda = 0.2787271, coef = 358.9542, t = 3.51, p = 0.0005). Edit: If you use ejectives as the dependent variable, the result is similar (estimated lambda = 0.8169142, coef = 0.00003975, t = 2.42, p = 0.0157).
I’d like to make two points: First, this kind of analysis is easy to do, and makes the test more rigorous (I did the above analyses at Singapore airport). Secondly, while the stats might hold up, this kind of approach can only point towards future research, rather than supplying definitive proof of the hypothesis. It’s an interesting proposal, and I look forwards to some modelling or experimental evidence.
The phylogenetic tree assumed languages within families evolved over 6,000 years and there was a common ancestor for all language families 60,000 years ago. You can see a diagram of the tree here, with WALS codes.
The altitude data I used comes from the 90-meter NASA database (SRTM3), extracted using the GPS Visualiser, while Everett uses surveys by Google Earth and ArcGIS v. 10.0. I checked some points and there are very slight differences in the order of a few meters.