Dialects in Tweets

January 12, 2011 in Linguistics, Research Blogging, Reviews

A recent study published in the proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP) in October and presented in the LSA conference last week found evidence of geographical lexical variation in Twitter posts. (For news stories on it, see here and here.) Eisenstein, O’Connor, Smith and Xing took a batch of Twitter posts from a corpus released of 15% of all posts during a week in March. In total, they kept 4.7 million tokens from 380,000 messages by 9,500 users, all geotagged from within the continental US. They cut out messages from over-active users, taking only messages from users with less than a thousand followers and followees (However, the average author published around 40~ posts per day, which might be seen by some as excessive. They also only took messages from iPhones and BlackBerries, which have the geotagging function. Eventually, they ended up with just over 5,000 words, of which a quarter did not appear in the spell-checking lexicon aspell.

The Generative Model

In order to figure out lexical variation accurately, both topic and geographical regions had to be ascertained. To do this, they used a generative model (seen above) that jointly figured these in. Generative models work on the assumption that text is the output of a stochastic process that can be analysed statistically. By looking at mass amounts of texts, they were able to infer the topics that are being talked about. Basically, I could be thinking of a few topics – dinner, food, eating out. If I am in SF, it is likely that I may end up using the word taco in my tweet, based on those topics. What the model does is take those topics and figure from them which words are chosen, while at the same time figuring in the spatial region of the author. This way, lexical variation is easier to place accurately, whereas before discourse topic would have significantly skewed the results (the median error drops from 650 to 500 km, which isn’t that bad, all in all.)

ResearchBlogging.orgThe way it works (in summary and quoting the slide show presented at the LSA annual meeting, since I’m not entirely sure on the details) is that, in order to add a topic, several things must be done. For each author, the model a) picks a region from P( r | ∂ ) b) picks a location from P( y | lambda, v ) and c) picks a distribution over P( Theta | alpha ). For each token, it must a) pick a topic from P( z | Theta ), and then b) pick a word from P( w | nu ). Or something like that (sorry). For more, feel free to download the paper on Eisenstien’s website.

This post was chosen as an Editor's Selection for ResearchBlogging.orgWell, what did they find? Basically, Twitter posts do show massive variation based on region. There are geographically-specific proper names, of course, and topics of local prominence, like taco in LA and cab in NY. There’s also variation in foreign language words, with pues in LA but papi in SF. More interestingly, however, there is a major difference in regional slang. ‘uu’, for instance, is pretty much exclusively on the Eastern seaboard, while ‘you’ is stretched across the nation (with ‘yu’ being only slightly smaller.) ‘suttin’ for something is used only in NY, as is ’deadass’ (meaning very) and, on and even smaller scale, ‘odee’, while ‘af’ is used for very in the Southwest, and ‘hella’ is used in most of the Western states.

Dialectical variation for 'very'

More importantly, though, the study shows that we can separate geographical and topical variation, as well as discover geographical variation from text instead of relying solely on geotagging, using this model. Future work from the authors is hoped to cover differences between spoken variation and variation in digital media. And I, for one, think that’s #deadass cool.

Jacob Eisenstein, Brendan O’Connor, Noah A. Smith, & Eric P. Xing (2010). A Latent Variable Model for Geographic Lexical Variation. Proceedings of EMNLP

Python: Your one stop shop for social science research.

January 12, 2011 in 101, Modelling

Python is easy to use and pretty solid across platforms.  Drew Conway has written an essential list of python tools for social science researchers.  From running experiments to analysis and modelling, Python can do practically anything you’d ever want, mostly in two or three lines.  Hooray!

http://www.drewconway.com/zia/?p=204

Kre-8-iv Spell!ng

January 10, 2011 in Uncategorized

A new blog about spelling conventions and change:

http://a2dez.com/

Recursion: what is it, who has it, and how did it evolve?

January 5, 2011 in Evolution, Linguistics

Hello Hello and Happy New Year,

So a new article appeared on the internet late last year by Coolidge, Overmann and Wynn (2010) (hereafter referred to as COW because it makes me smile). It’s a really short sweet little paper and you should read it as recursion is perhaps one of the hottest topics around language evolution. This generally stems from Hauser, Chomsky and Fitch’s (HCF, 2002) claim that it is the only feature of language unique to humans. I thought it would be useful to outline some of the issues surrounding it as put forward by the COW paper due to its high-profile, controversial and important position within current issues in language evolution.

History

Recursion was first talked about within the field of linguistics by Bar-Hillel in 1953. This was before Chomsky included the concept in his Generative Grammar in 1956.

It wasn’t until 2002 that HCF made the claim that recursion was the only feature of language which was included in the faculty of language in the narrow sense (FLN) and was therefore unique to humans.

Definition

The article outlines two definitions of recursion (within linguistics):

(1) embeddedness of phrases within other phrases, which entails keeping track of long-distance dependencies among phrases

(2) the specification of the computed output string itself, including meta-recursion, where recursion is both the recipe for an utterance and the overarching process that creates and executes the recipe

I always worry when there is more than one definition for a thing because this often results in people talking past eachother or getting confused within their own arguments. These definition are also important to define before one starts making claims about whether recursion is present in species outside of humans or what people are talking about when referring to the evolution of recursion.

Evolutionary Scenarios

The paper also outlines two evolutionary scenarios for the adaptive value of recursion in human language.

(1) The gradualist position posits precursors, such as animal communication and protolanguages, and holds that the selective purpose of recursion was for communication.

(2) The saltationist position assumes no gradual development of recursion and posits that it evolved for reasons other than communication

The latter of these is the stand point taken by the HCF paper. Reasons for recursion evolving if one discounts communication could include the increase of working memory for other reasons or spacial navigation.

Pinker and Jackendoff (2005) argue that since recursion only exists in language to express recursive thoughts it must have pre-existed language.

COW (2010) points out that this is all very well but the question remains of what are recursive thoughts and why are they adaptive? These recursive acts may exist for the purposes of diplomatic speech, perlocutionary acts or for prospective memory and cognition (these are discussed at greater length in COW). These assume that the adaptive force was a social one which before Pinker and Jackendoff was not considered because recursion is often understood away from the social context of speech acts in the realm of mathematics.

Unique to Humans?

An often cited example debunking recursion’s importance to human language is the Piraha tribe who apparently do not have it (Everett 2005). The data from Everett is anecdotal, from one source and sketchy. Even if one was to accept the claims of lack of recursion they can be attributed to other factors such as cultural constraints or (although I think this is going a bit far, but then Bickerton always does go a bit too far) claiming the Piraha tribe have an underlying neurophysiological deficiency such as a limited working memory capacity or an extreme case of acquisitional delay.

COW then covers several animal studies which claim that recursion is present in animals including starlings and various monkeys. These are subject to the claim that the ability to acquire a phrase structure grammar means the presence of recursive ability (which is bollocks). These studies also fall short when one considers that starling’s songs are used to communicate emotional states, not recursive thoughts.

References

Bar-Hillel Y. (1953) On recursive definitions in empirical science. Proceedings of the 11th International Congress of Philosophy, Brussels. 19535:160165.

Coolidge, F., Overmann, K., & Wynn, T. (2010). Recursion: what is it, who has it, and how did it evolve? Wiley Interdisciplinary Reviews: Cognitive Science DOI: 10.1002/wcs.131

Hauser MD, Chomsky, N, Fitch (2002) The faculty of language: what is it, who has it and how did it evolve? Science, 298:1569-1579

http://www.st-andrews.ac.uk/~wtsf/downloads/HCF2002.pdf

Bored birds, busy brains: Habituation to song initiates significant molecular changes in auditory forebrain of zebra finch

January 3, 2011 in Uncategorized

When we think of habituation, we tend to think of a process in which there is a decrease in psychological and behavioural response(s) over time following an organism’s exposure to a stimulus. Conceptualising habituation in this manner seems to imply the loss of something once an initial learning event has taken place. Although this may accurately describe what occurs at the psychological and behavioural levels, a study by a group of scientists from the University of Illinois (Dong et al. 2010), which examines habituation at the neurobiological level, shows that contrary to this conceptualisation, both initial exposure and habituation to song playbacks initiates a vast array of genetic activity in the zebra finch brain.

The systematic regulation of FoxP2 expression in singing zebra finches has been the subject of previous posts, but there is also a growing literature, of which Dong et al’s study is a part, documenting increases in ZENK gene (which encodes a transcription factor protein that in turn regulates the expression of other target genes) expression in zebra finch auditory forebrain areas in response to playbacks of song or the song of a conspecific. Studies showed that ZENK expression seems to mirror the typical decline in response associated with habituation in that after a certain amount of repetition, presentation of the song that originally elicited upregulation of ZENK no longer did so, and that ZENK returned to baseline levels – although upregulation of ZENK would occur if a different song or an aspect of novelty was introduced (i.e. the original song was presented in a different visual or spatial context).

What Dong et al. have demonstrated by conducting a large scale analysis of gene expression at initial exposure, habituation, and post-habituation stages however, is that unexpectedly profound genetic changes occur as a result of habituation in the absence of any additional novel stimuli following the surge of activity observed during initial exposure to novel song. Thus, the resounding merits of the Dong et al. (2010) study lie in the broadness of their approach, providing a true sense of magnitude with respect to genomic involvement in vocal communication and illuminating important influences that have gone unnoticed by studies with a narrower focus. I summarise the experimental design and findings of the paper below.

Read the rest of this entry →

Top-down vs bottom-up approaches to cognition: Griffiths vs McClelland

December 23, 2010 in Linguistics, Research Blogging, Science, Uncategorized

There is a battle about to commence.  A battle in the world of cognitive modelling.  Or at least a bit of a skirmish.  Two articles to be published in Trends in Cognitive Sciences debate the merits of approaching cognition from different ends of the microscope.

Structured Probabilistic Thingamy

On the side of probabilistic modelling we have Thom Griffiths, Nick Chater, Charles Kemp, Amy Perfors and Joshua Tenenbaum.  Representing (perhaps non-symbolically) emergentist approaches are James McClelland, Matthew Botvinick, David Noelle, David Plaut, Timothy Rogers, Mark Seidenberg and Linda B. Smith.  This contest is not short of heavyweights.

However, the first battleground seems to be who can come up with the most complicated diagram.  I leave this decision to the reader (see first two images).

The central issue is which approach is the most productive for explaining phenomena in cognition.  David Marr’s levels of explanation include the ‘computational’ characterisation of the problem, an ‘algorithmic’ description of the problem and an ‘implementational’ explanation which focusses on how the task is actually implemented by real brains.  Structured probabilistic takes a ‘top-down’ approach while Emergentism takes a ‘bottom-up’ approach.

Read the rest of this entry →

“Xanadu” Revisited (Culturomics?)

December 17, 2010 in Evolution

Google has just released an interesting dataset. Geoff Nunberg describes it at Language Log:

Culled from the Google Books collection, it contains more than 5 million books published between 1800 and 2000 — at a rough estimate, 4 percent of all the books ever published — of which two-thirds are in English and the others distributed among French, German, Spanish, Chinese, Russian, and Hebrew. (The English corpus alone contains some 360 billion words, dwarfing better structured data collections like the corpora of historical and contemporary American English at BYU, which top out at a paltry 400 million words each.)

It is, he says, “the largest corpus ever assembled for humanities and social science research.” The New York Times has reported on it and there’s an article in Science based on it.

You can also play around with it online with the Google Books Ngram Viewer. You enter individual words or phrases (up to five words long) and a Google graphs their frequency over time. I’ve spent a little time playing around with it.

In particular, I’m interested in the proper noun, “Xanadu.” As you may know, it’s the name of Kubla Khan’s summer capital and is also the second word in Coleridge’s most famous poem, “Kubla Khan.” Several years ago I did a Google search on “Xanadu” and was surprised to come up with over two-million hits. How’d that happen? I wondered.

I ended up writing a long post on The Valve, which generated an interesting discussion, and then distilling that down into a tech report. You can download the report here (One Candle, a Thousand Points of Light: The Xanadu Meme). Here’s the abstract:

I treat a single word ‘xanadu’, as a ‘meme’ and follow it from a 17th century book, to a 19th century poem (Coleridge’s “Kubla Khan”), into the 20th century where it was picked up by a classic movie (“Citizen Kane”), an ongoing software development project (Ted Nelson’s Project Xanadu), and another movie and hit song, Olivia Newton-John’s Xanadu. The aggregate result can be seen when you google the word, you get 6 million hits. What is interesting about those hits is that, while some of them are directly related to Coleridge’s poem, more seem to be related to Nelson’s software project, Olivia Newton-John’s film and song, and (indirectly) to Welles’ movie. Thus one cluster of Xanadu sites is high tech while another is about luxury and excess (and then there’s the Manchester Swingers Club Xanadu).

Read the rest of this entry →

Fun Language Experiment: Results

December 9, 2010 in Uncategorized

Two days ago I ran a pilot experiment online from Replicated Typo.  Thanks to all who took part. It’s a bit cheeky to exploit our readers, but it’s all in the name of science.  Unfortunately, the pilot was a complete failure. Suggestions and comments are welcome.

The experiment was into the role of variation in language learning.  Here’s what I was up to (plus the source code for running similar experiments):

Read the rest of this entry →

Fungus, -i. 2nd Decl. N. Masculine – or is it?: On Gender

December 8, 2010 in Evolution, Genetics, Linguistics, Research Blogging, Science

ResearchBlogging.orgIn an attempt to write out my thoughts for others instead of continually building them up in saved stickies, folders full of .pdfs, and hastily scribbled lecture notes, as if waiting for the spontaneous incarnation of what looks increasingly like a dissertation, I’m going to give a glimpse today of what I’ve been looking into recently. (Full disclosure: I am not a biologist, and was told specifically by my High School teacher that it would be best if I didn’t do another science class. Also, I liked Latin too much, which explains the title.)

It all started, really, with trying to get my flatmate Jamie into research blogging. His intended career path is mycology, where there are apparently fewer posts available for graduate study than in Old English syntax. As he was setting up the since-neglected Fungi Imperfecti, he pointed this article out to me: A Fungus Walks Into A Singles Bar. The post explains briefly how fungi have a very complicated sexual reproduction system.

Fungi are eukaryotes, the same as all other complex organisms with complicated cell structures. However, they are in their own kingdom, for a variety of reasons. Fungi are not the same as mushrooms, which are only the fruiting bodies of certain fungi. Their reproductive mechanisms is rather unexpectedly complex, in that the normal conventions of sex do not apply. Not all fungi reproduce sexually, and many are isogamous, meaning that their gametes look the same and differ only in certain alleles in certain areas called mating-type regions. Some fungi only have two mating types, which would give the illusion of being like animal genders. However, others, like Schizophyllum commune, have over ten thousand (although these interact in an odd way, such that they’re only productive if the mating regions are highly compatible (Uyenoyama 2005)).

Some fungi are homothallic, meaning that self-mating and reproduction is possible. This means that a spore has within it two dissimilar nuclei, ready to mate – the button mushroom apparently does this (yes, the kind you buy in a supermarket.) Heterothallic fungi, on the other hand, merely needs to find another fungi that isn’t the same mating type – which is pretty easy, if there are hundreds of options. Other types of fungi can’t reproduce together, but can vegetatively blend together to share resources, interestingly enough. Think of mind-melding, like Spock. Alternatively, think of mycelia fusing together to share resources.

In short, the system is ridiculously confusing, and not at all like the simple bipolar genders of, say, humans (if we take the canonical view of human gender, meaning only two.) I’m still trying to find adequate research on the origins of this sort of system. Understandably, it’s difficult. Mycologists agree:

“The molecular genetical studies of the past ten years have revealed a genetic fluidity in fungi that could never have been imagined. Transposons and other mobile elements can switch the mating types of fungi and cause chromosonal rearrangements.Deletions of mitochondrial genes can accumulate as either symptomless plasmids or as disruptive elements leading to cellular senescence…[in summary,] many aspects of the genetic fluidity of fungi remain to be resolved, and probably many more remain to be discovered.” (Deacon, 1997: pg. 157)

At this point you’re probably asking why I’ve posted this here. Well, perhaps understandably, I started drawing comparisons between mycologic mating types and linguistic agreement immediately. First, mating-type isn’t limited to bipolarity – neither is grammatical gender. Nearly 10% of the 257 languages noted for number of genders on WALS have more than five genders. Ngan’gityemerri seems to be winning, with 15 different genders (Reid, 1997). Gender distinctions generally have to do with a semantic core – one which need not be based on sex, either, but can cover categories like animacy. Gender can normally be diagnosed by agreement marking, which, taking out genetic analysis of the parent, could be analogic to fungi offspring. Gender can be a fluid system, susceptible to decay, mostly through attrition, but also to reformation and realignment – the same is true of mating types. (For more, see Corbett, 1991)

As with all biologic to linguistic analogues, the connections are a bit tenuous. I’ve been researching fungal replication partly for the sake of dispelling the old “that’s too complex to have evolved” argument, which is probably the most fun point to argue against creationists with. However, I’ve mostly been doing this because fungi and linguistic gender distinctions are just so damn interesting.

If anyone likes, I’ll keep you updated on mycologic evolution and the linguistic analogues I can tentatively draw. For now, though, I’ve really got to get back to studying for my examination in two days. Which means I’ve got to stop thinking about a future post involving detailing why “Prokaryotic evolution and the tree of life are two different things” (Baptiste et al., 2009)…

References:

  • Corbett, G. Gender. Cambridge University Press, Cambridge: 1991.
  • Deacon, JW. Modern Mycology. Blackwell Science, Oxford: 1997.
  • Reid, Nicholas. and Harvey, Mark David,  Nominal classification in aboriginal Australia / edited by Mark Harvey, Nicholas Reid John Benjamins Pub., Philadelphia, PA :  1997.

Uyenoyama, M. (2004). Evolution under tight linkage to mating type New Phytologist, 165 (1), 63-70 DOI: 10.1111/j.1469-8137.2004.01246.x
Bapteste E, O’Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, Lapointe FJ, Dupré J, Dagan T, Boucher Y, & Martin W (2009). Prokaryotic evolution and the tree of life are two different things. Biology direct, 4 PMID: 19788731

Fun Language Experiment!

December 7, 2010 in Uncategorized

I’m running a fun language evolution experiment online!  You can take part – it only takes 4 minutes.  Once I get enough participants, I’ll post the results right here on ReplicatedTypo.

Take part here!

(You’ll need sound enabled)

UPDATE

The experiment is complete, see the results here.