Dialects in Tweets

A recent study published in the proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP) in October and presented in the LSA conference last week found evidence of geographical lexical variation in Twitter posts. (For news stories on it, see here and here.) Eisenstein, O’Connor, Smith and Xing took a batch of Twitter posts from a corpus released of 15% of all posts during a week in March. In total, they kept 4.7 million tokens from 380,000 messages by 9,500 users, all geotagged from within the continental US. They cut out messages from over-active users, taking only messages from users with less than a thousand followers and followees (However, the average author published around 40~ posts per day, which might be seen by some as excessive. They also only took messages from iPhones and BlackBerries, which have the geotagging function. Eventually, they ended up with just over 5,000 words, of which a quarter did not appear in the spell-checking lexicon aspell.

The Generative Model

In order to figure out lexical variation accurately, both topic and geographical regions had to be ascertained. To do this, they used a generative model (seen above) that jointly figured these in. Generative models work on the assumption that text is the output of a stochastic process that can be analysed statistically. By looking at mass amounts of texts, they were able to infer the topics that are being talked about. Basically, I could be thinking of a few topics – dinner, food, eating out. If I am in SF, it is likely that I may end up using the word taco in my tweet, based on those topics. What the model does is take those topics and figure from them which words are chosen, while at the same time figuring in the spatial region of the author. This way, lexical variation is easier to place accurately, whereas before discourse topic would have significantly skewed the results (the median error drops from 650 to 500 km, which isn’t that bad, all in all.)

The way it works (in summary and quoting the slide show presented at the LSA annual meeting, since I'm not entirely sure on the details) is that, in order to add a topic, several things must be done. For each author, the model a) picks a region from P( r | ∂ ) b) picks a location from P( y | lambda, v ) and c) picks a distribution over P( Theta | alpha ). For each token, it must a) pick a topic from P( z | Theta ), and then b) pick a word from P( w | nu ). Or something like that (sorry). For more, feel free to download the paper on Eisenstien's website.

Well, what did they find? Basically, Twitter posts do show massive variation based on region. There are geographically-specific proper names, of course, and topics of local prominence, like taco in LA and cab in NY. There's also variation in foreign language words, with pues in LA but papi in SF. More interestingly, however, there is a major difference in regional slang. 'uu', for instance, is pretty much exclusively on the Eastern seaboard, while 'you' is stretched across the nation (with 'yu' being only slightly smaller.) 'suttin' for something is used only in NY, as is 'deadass' (meaning very) and, on and even smaller scale, 'odee', while 'af' is used for very in the Southwest, and 'hella' is used in most of the Western states.

Dialectical variation for 'very'

More importantly, though, the study shows that we can separate geographical and topical variation, as well as discover geographical variation from text instead of relying solely on geotagging, using this model. Future work from the authors is hoped to cover differences between spoken variation and variation in digital media. And I, for one, think that’s #deadass cool.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, & Eric P. Xing (2010). A Latent Variable Model for Geographic Lexical Variation. Proceedings of EMNLP

Genetic Components and Cultural Differences: The social sensitivity hypothesis

ResearchBlogging.orgCultural differences are often attributed to events far removed from genetics. The basis for this belief is often based on the assertion that if you take an individual, at birth, from one society and implant them in another, then they will generally grow up to become well-adjusted to their adopted culture. Whilst this is more than likely true, even if there may be certain cultural features that may disagree with someone of a different ethnic background (e.g. degrees of alcohol tolerance), the situation is not as clear cut as certain political factions may have you believe.  Yet, largely due to studies on gene-culture coevolution, we are now starting to understand the complex dynamics through which genes and culture interact.

First, a particular culture may exert selection pressures on genes that provide an advantageous benefit to the adoption of a particular cultural trait. This is evident in the strong selection of the lactose-tolerance allele due to the spread of dairy farming. Second, pre-existing gene distributions provide pressures through which culture adapts. Off the top of my head, one proposed example of this is a paper by Dediu and Ladd (2007), which looked at how the distribution of the derived haplotypes of ASPM and Microcephalin may have subtly influenced the development of tonal languages. The paper in question, however, is looking more broadly at culture. Specifically, the authors, Baldwin May and Matthew Lieberman, examine recent genetic association studies and how within-variation of genes involved in central neurotransmitter systems are associated with differences in social sensitivity. In particular, they highlight a correlation between the relative frequencies of certain gene-variants and the relative degree of individualism or collectivism within certain populations.

Genetic Components and Cultural Differences: The social sensitivity hypothesis

Some Links #4

Back to the future on syntax and Broca’s area. Talking Brains provide a concise and humorous post about why Broca’s area is not the seat of syntax, be it domain-specific or domain-general. I tend to think that areas important for syntactic processing are probably distributed throughout the left perisylvian region. Hence why Broca’s aphasiacs are quite capable of making grammatical judgements. Then again, another reason why damage to Broca’s area doesn’t, to quote Hickok, “obliterate the ability to make such judgements”, is because the processing shifts to another region (sort of an ancillary system). This is very possible in the advent of neuroplasticity.

Neuroplasticity is a dirty word. Having mentioned neuroplasticity, I now feel obligated to mention this brilliant post over at Mind Hacks. It provides a sort of 101 approach to neuroplasticity, which, after all, simply means something in the brain has changed. Still, as one poster from Ethnographer.com pointed out: “However, at its most abstract, the concept of neuroplasticity is often arrayed against that other commonplace abstract notion, that the brain is genetically ‘hard-wired’ in some way”.

Dialect Geography and Social Networks. Mark Lieberman over at Language Log discusses geographical patterns of linguistic variation and recent analyses of facebook networks in the US. Put succinctly: they don’t line up very well. He also asks some interesting questions about the role facebook might play as a proxy for communication patterns.

How best to learn R. R is an invaluable statistical package. If, like me, you find yourself being dropped in at the deep end, then things can seem slightly confusing in an environment that is far less user friendly than, say, SPSS. All the important stuff is in the comments section of the post, but you should take some time out to have a general poke around Statistical Modelling, Causal Inference and Social Science.

Are Scottish People Living Dangerously? The short answer: Yes. Barking Up The Wrong Tree links to a study claiming that “Almost the entire adult population of Scotland (97.5%) are likely to be either cigarette smokers, heavy drinkers, physically inactive, overweight or have a poor diet.”

The Sun Gone Crazy? Apparently, for the past two years there’s been a prolonged absence in sunspots. But as Adam Frank mentions, “The magnetic activity of stars like sun, which is the root cause of the sunspot cycle, is still poorly understood even after decades of intense study.  It’s more than an academic concern”.

Three Questions for Michael Tomasello. A cool little interview with the chimpanzee, linguistic and cooperation guru, Michael Tomasello, over at cognition and culture.

Lady Liberty's Awful Health

Readers from either Britain or the US will know about the relatively recent furore over comparisons between private and NHS-style healthcare. I was hoping to post an old article I wrote about the topic, but sadly it’s disappeared from my hard drive. Instead, here is a very good video from the New Scientist website that takes a scientific, rather than a political approach to the problem:

Hat tip to Evolving Thoughts.

Olfactory communication and mate choice

ResearchBlogging.orgFrom the regulation and reproduction in bacteria colonies (Bassler, 2002) to complex smell and taste systems of humans (Van Toller & Dodd, 1988), the ability of sensing chemical stimuli, known as chemosensation, is believed to be the most basic and ubiquitous of senses (Bhutta, 2007). One strain of thought places chemosensation as merely an evolved ability to detect dangerous and volatile substances – such as putrefied food (see Bhutta, 2007). Still, the notion that this ability to detect chemical stimuli, particularly in the domain of smell, serves a purpose in communication is not necessarily a contemporary concept (Wyatt, 2009).

Olfactory communication and mate choice