lexicon – Replicated Typo

Cultural differences in lateral transmission: Phylogenetic trees are OK for Linguistics but not biology

An article in PLos ONE debunks the myth that hunter-gatherer societies borrow more words than agriculturalist societies. In doing so, it suggests that horizontal transmission is low enough for phylogenetic analyses to be a valid linguistic tool.

Lexicons from around 20% of the extant languages spoken by hunter-gatherer societies were coded for etymology (available in the supplementary material). The levels of borrowed words were compared with the languages of agriculturalist and urban societies taken from the World Loanword Database. The study focussed on three locations: Northern Australia, northwest Amazonia, and California and the Great Basin.

In opposition to some previous hypotheses, hunter-gatherer societies did not borrow significantly more words than agricultural societies in any of the regions studied.

The rates of borrowing were universally low, with most languages not borrowing more than 10% of their basic vocabulary. The mean rate for hunter-gatherer societies was 6.38% while the mean for 5.15%. This difference is actually significant overall, but not within particular regions. Therefore, the authors claim, “individual area variation is more important than any general tendencies of HG or AG languages”.

Interestingly, in some regions, mobility, population size and population density were significant factors. Mobile populations and low-density populations had significantly lower borrowing rates, while smaller populations borrowed proportionately more words. This may be in line with the theory of linguistic carrying capacity as discussed by Wintz (see here and here). The level of exogamy was a significant factor in Australia.

The study concludes that phylogenetic analyses are a valid form of linguistic analysis because the level of horizontal transmission is low. That is, languages are tree-like enough for phylogenetic assumptions to be valid:

“While it is important to identify the occasional aberrant cases of high borrowing, our results support the idea that lexical evolution is largely tree-like, and justify the continued application of quantitative phylogenetic methods to examine linguistic evolution at the level of the lexicon. As is the case with biological evolution, it will be important to test the fit of trees produced by these methods to the data used to reconstruct them. However, one advantage linguists have over biologists is that they can use the methods we have described to identify borrowed lexical items and remove them from the dataset. For this reason, it has been proposed that, in cases of short to medium time depth (e.g., hundreds to several thousand years), linguistic data are superior to genetic data for reconstructing human prehistory “

Excellent – linguistics beats biology for a change!

However, while the level of horizontal transmission might not be a problem in this analysis, there may be a problem in the paths of borrowing. If a language borrows relatively few words, but those words come from many different languages, and may have many paths through previous generations, there may be a subtle effect of horizontal transition that is being masked. The authors acknowledge that they did not address the direction of transmission in a quantitative way.

A while ago, I did study of English etymology trying to quantify the level of horizontal transmission through time (description here). The graph for English doesn’t look tree-like to me, perhaps the dynamics of borrowing works differently for languages with a high level of contact:

Claire Bowern, Patience Epps, Russell Gray, Jane Hill, Keith Hunley, Patrick McConvell, Jason Zentz (2011). Does Lateral Transmission Obscure Inheritance in Hunter-Gatherer Languages? PLoS ONE, 6 (9) : doi:10.1371/journal.pone.0025195

Sonority and Sex: Why smaller communities are louder

Through this post on Sprogmuseet about Atkinson’s analysis of the out of Africa hypothesis, I found an article by Ember & Ember (2007) (who also quantified the link between colour lexicon size and distance from the equator, see my post here) on Sonority and climate. The article extends work by Fought et al. (2004) which finds that a language’s sonority is related to climate. Sonority is a measure of amplitude (loudness) as is greater for vowels than for consonants (for example, see here). Basically, the warmer the climate, the greater the sonority of the phoneme inventory of the population. The theory is that “people in warmer climates generally spend more time outdoors and communicate at a distance more often than people in colder climates”.

Continue reading “Sonority and Sex: Why smaller communities are louder”

The Genesis of Grammar

In my previous post on linguistic replicators and major transitions, I mentioned grammaticalisation as a process that might inform us about the contentive-functional split in the lexicon. Naturally, it makes sense that grammaticalisation might offer insights into other transitions in linguistics, and, thanks to an informative comment from a regular reader, I was directed to a book chapter by Heine & Kuteva (2007): The Genesis of Grammar: On combining nouns. I might dedicate a post to the paper in the future, but, as with many previous claims, this probably won’t happen. So instead, here is the abstract and a table of the authors’ hypothesised grammatical innovations:

That it is possible to propose a reconstruction of how grammar evolved in human languages is argued for by Heine and Kuteva (2007). Using observations made within the framework of grammaticalization theory, these authors hypothesize that time-stable entities denoting concrete referential concepts, commonly referred to as ‘nouns’, must have been among the first items distinguished by early humans in linguistic discourse. Based on crosslinguistic findings on grammatical change, this chapter presents a scenario of how nouns may have contributed to introducing linguistic complexity in language evolution.

Evolving Linguistic Replicators: Major Transitions and Grammaticalisation

Just before Christmas I found myself in the pub speaking to Sean about his work on bilingualism, major transitions and the contrast between language change and the cultural evolution of language. Now, other than revealing that our social time is spent discussing our university work, the conversation brought up a distinction not often made: whilst language change is part of language evolution, the latter is also what we consider to be a major transition. As you evolutionary biologists will know, this concept is perhaps best examined and almost certainly popularised in Maynard Smith & Szathmáry’s (1995) The Major Transitions in Evolution. Here, the authors are not promoting the fallacy of guided evolution, where the inevitable consequence is increased and universal complexity. Their thesis is more subtle: that some lineages become more complex over time, with this increase being attributable to the way in which genetic information is transmitted between generations. In particular, they note eight transitions in the evolution of life:

What’s notable about these transitions, and why they aren’t necessarily an arbitrary list, is that all of them share some broad commonalities, namely:

Continue reading “Evolving Linguistic Replicators: Major Transitions and Grammaticalisation”

Dialects in Tweets

A recent study published in the proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP) in October and presented in the LSA conference last week found evidence of geographical lexical variation in Twitter posts. (For news stories on it, see here and here.) Eisenstein, O’Connor, Smith and Xing took a batch of Twitter posts from a corpus released of 15% of all posts during a week in March. In total, they kept 4.7 million tokens from 380,000 messages by 9,500 users, all geotagged from within the continental US. They cut out messages from over-active users, taking only messages from users with less than a thousand followers and followees (However, the average author published around 40~ posts per day, which might be seen by some as excessive. They also only took messages from iPhones and BlackBerries, which have the geotagging function. Eventually, they ended up with just over 5,000 words, of which a quarter did not appear in the spell-checking lexicon aspell.

In order to figure out lexical variation accurately, both topic and geographical regions had to be ascertained. To do this, they used a generative model (seen above) that jointly figured these in. Generative models work on the assumption that text is the output of a stochastic process that can be analysed statistically. By looking at mass amounts of texts, they were able to infer the topics that are being talked about. Basically, I could be thinking of a few topics – dinner, food, eating out. If I am in SF, it is likely that I may end up using the word taco in my tweet, based on those topics. What the model does is take those topics and figure from them which words are chosen, while at the same time figuring in the spatial region of the author. This way, lexical variation is easier to place accurately, whereas before discourse topic would have significantly skewed the results (the median error drops from 650 to 500 km, which isn’t that bad, all in all.)

The way it works (in summary and quoting the slide show presented at the LSA annual meeting, since I’m not entirely sure on the details) is that, in order to add a topic, several things must be done. For each author, the model a) picks a region from P( r | ∂ ) b) picks a location from P( y | lambda, v ) and c) picks a distribution over P( Theta | alpha ). For each token, it must a) pick a topic from P( z | Theta ), and then b) pick a word from P( w | nu ). Or something like that (sorry). For more, feel free to download the paper on Eisenstien’s website.

Well, what did they find? Basically, Twitter posts do show massive variation based on region. There are geographically-specific proper names, of course, and topics of local prominence, like taco in LA and cab in NY. There’s also variation in foreign language words, with pues in LA but papi in SF. More interestingly, however, there is a major difference in regional slang. ‘uu’, for instance, is pretty much exclusively on the Eastern seaboard, while ‘you’ is stretched across the nation (with ‘yu’ being only slightly smaller.) ‘suttin’ for something is used only in NY, as is ‘deadass’ (meaning very) and, on and even smaller scale, ‘odee’, while ‘af’ is used for very in the Southwest, and ‘hella’ is used in most of the Western states.

Screen shot 2011-01-12 at 23.41.24 — Dialectical variation for 'very'

More importantly, though, the study shows that we can separate geographical and topical variation, as well as discover geographical variation from text instead of relying solely on geotagging, using this model. Future work from the authors is hoped to cover differences between spoken variation and variation in digital media. And I, for one, think that’s #deadass cool.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, & Eric P. Xing (2010). A Latent Variable Model for Geographic Lexical Variation. Proceedings of EMNLP

Mutual Exclusivity in the Naming Game

The Categorisation Game or Naming Game looks at how agents in a population converge on a shared system for referring to continuous stimuli (Steels, 2005; Nowak & Krakauer, 1999). Agents play games with each other, one referring to an object with a word and the other trying to guess what object the first agent was referring to. Through experience with the world and feedback from other agents, agents update their words. Eventually, agents are able to communicate effectively. The model is usually couched in terms of agents trying to agree on labels for colours (a continuous meaning space). In this post I’ll show that the algorithms used have implicit mutual exclusivity biases, which favour monolingual viewpoints. I’ll also show that this bias is not necessary and obscures some interesting insights into evolutionary dynamics of langauge.

Continue reading “Mutual Exclusivity in the Naming Game”

Evolution of Colour Terms: 5 Cultural Constraints

Continuing my series on the Evolution of Colour terms, this post reviews studies of cultural constraints on colour naming. For the full dissertation and for references, go here.

Continue reading “Evolution of Colour Terms: 5 Cultural Constraints”

Language Evolution and Language Acquisition

The way children learn language sets the adaptive landscape on which languages evolve. This is acknowledged by many, but there are few connections between models of language acquisition and models of language Evolution (some exceptions include Yang (2002), Yu & Smith (2007) and Chater & Christiansen (2009)).

However, the chasm between the two fields may be getting smaller, as theories are defined as models which are both more interpretable to the more technically-minded Language Evolutionists and extendible into populations and generations.

Also, strangely, models of word learning have been getting simpler over time. This may reflect a move from attributing language acquisition to specific mechanisms towards a more general cognitive explanation. I review some older models here, and a recent publication by Fazly et al.

Continue reading “Language Evolution and Language Acquisition”

Can linguistic features reveal time depths as deep as 50,000 years ago?

Throughout much of our history language was transitory, existing only briefly within its speech community. The invention of writing systems heralded a way of recording some of its recent history, but for the most part linguists lack the stone tools archaeologists use to explore the early history of ancient technological industries. The question of how far back we can trace the history of languages is therefore an immensely important, and highly difficult, one to answer. However, it’s not impossible. Like biologists, who use highly conserved genes to probe the deepest branches on the tree of life, some linguists argue that highly stable linguistic features hold the promise of tracing ancestral relations between the world’s languages.

Previous attempts using cognates to infer the relatedness between languages are generally limited to predictions within the last 6000-10,000 years. In the present study, Greenhill et al (2010) decided to examine more stable linguistic features than the lexicon, arguing:

Continue reading “Can linguistic features reveal time depths as deep as 50,000 years ago?”