media – Replicated Typo

A recent study published in the proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP) in October and presented in the LSA conference last week found evidence of geographical lexical variation in Twitter posts. (For news stories on it, see here and here.) Eisenstein, O’Connor, Smith and Xing took a batch of Twitter posts from a corpus released of 15% of all posts during a week in March. In total, they kept 4.7 million tokens from 380,000 messages by 9,500 users, all geotagged from within the continental US. They cut out messages from over-active users, taking only messages from users with less than a thousand followers and followees (However, the average author published around 40~ posts per day, which might be seen by some as excessive. They also only took messages from iPhones and BlackBerries, which have the geotagging function. Eventually, they ended up with just over 5,000 words, of which a quarter did not appear in the spell-checking lexicon aspell.

In order to figure out lexical variation accurately, both topic and geographical regions had to be ascertained. To do this, they used a generative model (seen above) that jointly figured these in. Generative models work on the assumption that text is the output of a stochastic process that can be analysed statistically. By looking at mass amounts of texts, they were able to infer the topics that are being talked about. Basically, I could be thinking of a few topics – dinner, food, eating out. If I am in SF, it is likely that I may end up using the word taco in my tweet, based on those topics. What the model does is take those topics and figure from them which words are chosen, while at the same time figuring in the spatial region of the author. This way, lexical variation is easier to place accurately, whereas before discourse topic would have significantly skewed the results (the median error drops from 650 to 500 km, which isn’t that bad, all in all.)

The way it works (in summary and quoting the slide show presented at the LSA annual meeting, since I’m not entirely sure on the details) is that, in order to add a topic, several things must be done. For each author, the model a) picks a region from P( r | ∂ ) b) picks a location from P( y | lambda, v ) and c) picks a distribution over P( Theta | alpha ). For each token, it must a) pick a topic from P( z | Theta ), and then b) pick a word from P( w | nu ). Or something like that (sorry). For more, feel free to download the paper on Eisenstien’s website.

Well, what did they find? Basically, Twitter posts do show massive variation based on region. There are geographically-specific proper names, of course, and topics of local prominence, like taco in LA and cab in NY. There’s also variation in foreign language words, with pues in LA but papi in SF. More interestingly, however, there is a major difference in regional slang. ‘uu’, for instance, is pretty much exclusively on the Eastern seaboard, while ‘you’ is stretched across the nation (with ‘yu’ being only slightly smaller.) ‘suttin’ for something is used only in NY, as is ‘deadass’ (meaning very) and, on and even smaller scale, ‘odee’, while ‘af’ is used for very in the Southwest, and ‘hella’ is used in most of the Western states.

Screen shot 2011-01-12 at 23.41.24 — Dialectical variation for 'very'

More importantly, though, the study shows that we can separate geographical and topical variation, as well as discover geographical variation from text instead of relying solely on geotagging, using this model. Future work from the authors is hoped to cover differences between spoken variation and variation in digital media. And I, for one, think that’s #deadass cool.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, & Eric P. Xing (2010). A Latent Variable Model for Geographic Lexical Variation. Proceedings of EMNLP

As a linguist I struggle with genetics, I am, however, as an evolution geek, very interested in it. This creates all sorts of problems and high levels of anxiety when talking about FOXP2 and other genes, due to fear that I misunderstand the very highly complex interactions which exist between genes, environmental effects or cascading effects which cannot be summed up in a simple “x gene causes x trait in humans” paradigm.

I would like to point everyone towards a new blog Dorothy Bishop’s written over at guardian science blogs:

Where does the myth of a gene for things like intelligence come from?

Which is about busting the widespread belief (for idiots like me) that individual genes determine traits such as intelligence, optimism, obesity and dyslexia. I find it interesting that this is presented in the blogs section and not as a mainstream article.

She points out on Twitter this morning that the Jedward pic was not her idea. (I add this point because I found it weirdly comforting)

And it’s also lovely to see that at the bottom of the pile of comments is a well articulated reply by Dorothy to individual users.

I love blogging, because there exists the ability for individuals to reply to claims made about them, primary sources (papers &c.) are cited and checkable and there’s none of the unnecessary dumbing down found in mainstream media. Here’s an article by Ben Goldacre expanding on this subject (which incidentally includes work by Dorothy Bishop).

Here is a parable about how, as a blogger, my claims were checked, discussed and ultimately concluded to be bollocks. (I don’t have a contrastive parable about what would have happened if I’d instead made these claims in the mainstream media but many stories of this nature can be found here.)

If you read the blog post I wrote about links between Autism and SLI you would have seen me make this claim:

the CNTNAP2 gene has been found in independent samples to be associated with both ASD and SLI. This is interesting because it could show that gene mutations which cause improved social abilities could have also caused changes in our linguistic ability on a syntactic or phonological level.

This blog post cited the work of Dorothy Bishop quite heavily and she took the time out to come and tell me problems with it. Here’s what she said:

As you anticipated, I think there are some problems with the implications you draw from the work. There are two issues. The first is that the variants of CNTNAP2 associated with language level are not mutations. You would usually only use that term in the case where most people had the same DNA sequence in a gene, but rare individuals had a different DNA sequence. FOXP2 is a case in point: there is a family, the KE family, who have a mutation affecting around half the family members, where the DNA sequence is changed. For most people in the general population, and for most people with SLI, the FOXP2 sequence is the same.

The CNTNAP gene is very different. The DNA sequence has different versions in different people, and one version, which is pretty common in the general population, is associated with a small decrease in language abilities, but most people with this version would not be recognised as having any language impairment. Most researchers now think that SLI is probably the result of the combined effect of many genes, each of which may nudge language ability up or down a bit. In this regard, language ability is rather like height: there are rare mutations that may make a person drastically tall or short, but most variation in height arises from combined effect of many small influences of genes that show DNA variation in the normal population.

The second issue concerns the evidence for CNTNAP2 being involved in both SLI and autism. Many people in the field do think this means that the same gene that can cause SLI can also cause autism, and that the only difference is that people with autism have additional difficulties going beyond language – what I have termed the ‘autism as SLI plus’ model. I supported that model in the past, but there are some facts that are hard to square with it. First, although many people with autism have structural language problems (affecting grammar and phonology) similar to those in SLI, not all of them do. So people with high-functioning autism or Asperger syndrome may have well-developed skills in syntax and phonology, while still having difficulties with pragmatics. The second point, which is a big problem for a simple genetic account, is that whereas the relatives of people with SLI often have some difficulties with structural language, we don’t usually see that in relatives of people with autism, even if the person with autism has poor language skills. It was this latter point that I was particularly keen to try and explain in my paper. The bottom line is that to explain the pattern of data we need to think in terms of interactions between genes (technically known as epistasis). So there are genetic variants that increase risk of autism, and others that increase risk of SLI. Most of these will have an individually small effect. However, if you have a risk variant for a gene influencing SLI (such as CNTNAP2) in the context of having a genetic risk for autism, the effect on language will be much worse. According to this model CNTNAP2 doesn’t affect both social cognition and language; rather it affects language, but that effect will get multiplied if the person also has risk factors for autism.

Which is SOOOO interesting.

I’d really like to thank her for replying, it’s really lovely to know that high-flying academics are willing to help out when a sincere blogger tries to understand something and falls on their arse.

Tag: media

Dialects in Tweets

Where does the myth of a gene for things like intelligence come from?