There is a huge amount of linguistic diversity in the world. Isolation and drift due to cultural evolution can explain much of this, but there are many cases where interacting groups use several languages. In fact, by some estimates, bilingualism is the norm for most societies. If one views language as a tool for communicating about objects and events, it seems strange that linguistic diversity should be maintained over time for two reasons. First, it seems more efficient, during language use, to have a one-to-one mapping between signals and meanings. In fact, mutual exclusivity is exhibited by young children and has been argued to be an innate bias and crucial to the evolution of a linguistic system. How or why do bilinguals over-ride this bias? Secondly, learning two language systems must be more difficult than learning one. What is the motivation for expending extra effort on learning an apparently redundant system?
There is a huge amount of linguistic diversity in the world. Isolation and drift due to cultural evolution can explain much of this, but there are many cases where linguistic diversity emerges and persists within groups of interacting individuals. Previous research has identified the use of linguistic cues of identity as an important factor in the development of linguistic diversity (e.g. Nettle, 1999). Gareth Roberts looks at this issue with an experimental paradigm.
This experiment was a game where individuals had to trade commodities in a series of rounds. At each round, individuals were paired up either with a team-mate or a competitor, though the speaker’s true identity was hidden. Players were given random resources, but scored points based on how ‘balanced’ their resources were after trading (that is, you were punished for having much more meat than corn, for example). A commodity given to another individual was worth twice as much to the receiver as to the donor.
Players could only interact through an ‘alien’ language via an instant-messaging system. Prior to the game, individuals learned an artificial language which they were to use in these interactions. All participants were initially given the same starting language. There were several conditions that manipulated the frequency with which you interacted with your team-mate and whether the task was competitive or co-operative. In the co-operative condition, four players were considered as part of the same team and the task was to get a high a score as possible. In the competitive condition the four players were split into two groups and the task was to score more than the other team. In this condition, then, the main task was to identify whether your partner was a co-operator or a competitor.
The results showed that, if players interacted frequently enough with their team-mates and were in competition with another group, then linguistic diversity emerged. Over the course of the game each team developed its own ‘variety’, and this was used as a marker of group identity. For example, in one game two forms of the word for ‘you’ arose. Players in one team tended to use ‘lale’ while players in the other team tended to use ‘lele’, meaning that players could tell group membership from this variation. Thus, linguistic variation arose due to the linguistic system evolving to encode the identity of the speakers.
The diversity seemed to arise both from drift and intentional change, both of which have been documented in the sociolinguistic literature. Roberts suggests that linguistic markers make good social markers because they are costly to obtain (so difficult for free-riders to fake), salient and flexible enough to cope with changing group dynamics. In the next post, I’ll be thinking about a similar experiment looking at how linguistic variation might arise in a co-operative scenario.
Roberts, G. (2010). An experimental study of social selection and frequency of interaction in linguistic diversity Interaction Studies, 11 (1), 138-159 DOI: 10.1075/is.11.1.06rob
What do people expect when they sign up to a linguistics experiment?
I’m currently running an experiment and so I posted an ad for participants. It simply states “You will take part in a linguistics experiment. You will be paid £6.”, and gives my email. I got 30 replies in a few hours, but was struck by the variation in the responses. Here are some pointless graphs:
First, a look at the distribution of email subjects. This reveals that most people know they are going to participate in an experiment, but fewer realise that they will contribute to research. One person thought that they would be doing a “Research Study”. What’s one of them?
However, here’s the killer. Analysing the first lines of the emails, I noticed a distinct power law relationship between frequency and casualness.
People just don’t respect linguists any more.
And that’s why I make people do hours of mind-bending iterated learning experiments with spinning cats.
A recent study published in the proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP) in October and presented in the LSA conference last week found evidence of geographical lexical variation in Twitter posts. (For news stories on it, see here and here.) Eisenstein, O’Connor, Smith and Xing took a batch of Twitter posts from a corpus released of 15% of all posts during a week in March. In total, they kept 4.7 million tokens from 380,000 messages by 9,500 users, all geotagged from within the continental US. They cut out messages from over-active users, taking only messages from users with less than a thousand followers and followees (However, the average author published around 40~ posts per day, which might be seen by some as excessive. They also only took messages from iPhones and BlackBerries, which have the geotagging function. Eventually, they ended up with just over 5,000 words, of which a quarter did not appear in the spell-checking lexicon aspell.
In order to figure out lexical variation accurately, both topic and geographical regions had to be ascertained. To do this, they used a generative model (seen above) that jointly figured these in. Generative models work on the assumption that text is the output of a stochastic process that can be analysed statistically. By looking at mass amounts of texts, they were able to infer the topics that are being talked about. Basically, I could be thinking of a few topics – dinner, food, eating out. If I am in SF, it is likely that I may end up using the word taco in my tweet, based on those topics. What the model does is take those topics and figure from them which words are chosen, while at the same time figuring in the spatial region of the author. This way, lexical variation is easier to place accurately, whereas before discourse topic would have significantly skewed the results (the median error drops from 650 to 500 km, which isn’t that bad, all in all.)
The way it works (in summary and quoting the slide show presented at the LSA annual meeting, since I’m not entirely sure on the details) is that, in order to add a topic, several things must be done. For each author, the model a) picks a region from P( r | ∂ ) b) picks a location from P( y | lambda, v ) and c) picks a distribution over P( Theta | alpha ). For each token, it must a) pick a topic from P( z | Theta ), and then b) pick a word from P( w | nu ). Or something like that (sorry). For more, feel free to download the paper on Eisenstien’s website.
Well, what did they find? Basically, Twitter posts do show massive variation based on region. There are geographically-specific proper names, of course, and topics of local prominence, like taco in LA and cab in NY. There’s also variation in foreign language words, with pues in LA but papi in SF. More interestingly, however, there is a major difference in regional slang. ‘uu’, for instance, is pretty much exclusively on the Eastern seaboard, while ‘you’ is stretched across the nation (with ‘yu’ being only slightly smaller.) ‘suttin’ for something is used only in NY, as is ‘deadass’ (meaning very) and, on and even smaller scale, ‘odee’, while ‘af’ is used for very in the Southwest, and ‘hella’ is used in most of the Western states.
More importantly, though, the study shows that we can separate geographical and topical variation, as well as discover geographical variation from text instead of relying solely on geotagging, using this model. Future work from the authors is hoped to cover differences between spoken variation and variation in digital media. And I, for one, think that’s #deadass cool.
Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, & Eric P. Xing (2010). A Latent Variable Model for Geographic Lexical Variation. Proceedings of EMNLP
Two days ago I ran a pilot experiment online from Replicated Typo. Thanks to all who took part. It’s a bit cheeky to exploit our readers, but it’s all in the name of science. Unfortunately, the pilot was a complete failure. Suggestions and comments are welcome.
The experiment was into the role of variation in language learning. Here’s what I was up to (plus the source code for running similar experiments):