Tag Archives: linguistics

Corpus Linguistics, Literary Studies, and Description

One of my main hobbyhorses these days is description. Literary studies has to get a lot more sophisticated about description, which is mostly taken for granted and so is not done very rigorously. There isn’t even a sense that there’s something there to be rigorous about. Perhaps corpus linguistics is a way to open up that conversation.
The crucial insight is this: What makes a statement descriptive IS NOT how one arrives at it, but the role it plays in the larger intellectual enterprise.

A Little Background Music

Back in the 1950s there was this notion that the process of aesthetic criticism took the form of a pipeline that started with description, moved on to analysis, then interpretation and finally evaluation. Academic literary practice simply dropped evaluation altogether and concentrated its efforts on interpretation. There were attempts to side-step the difficulties of interpretation by asserting that one is simply describing what’s there. To this Stanley Fish has replied (“What Makes an Interpretation Acceptable?” in Is There a Text in This Class?, Harvard 1980, p. 353):

 

The basic gesture then, is to disavow interpretation in favor of simply presenting the text: but it actually is a gesture in which one set of interpretive principles is replaced by another that happens to claim for itself the virtue of not being an interpretation at all.

 

And that takes care of that.
Except that it doesn’t. Fish is correct in asserting that there’s no such thing as a theory-free description. Literary texts are rich and complicated objects. When the critic picks this or that feature for discussion those choices are done with something in mind. They aren’t innocent.
But, as Michael Bérubé has pointed out in “There is Nothing Inside the Text, or, Why No One’s Heard of Wolfgang Iser” (in Gary Olson and Lynn Worsham, eds. Postmodern Sophistries, SUNY Press 2004, pp. 11-26) there is interpretation and there is interpretation and they’re not alike. The process by which the mind’s eye makes out letters and punctuation marks from ink smudges is interpretive, for example, but it’s rather different from throwing Marx and Freud at a text and coming up with meaning.
Thus I take it that the existence of some kind of interpretive component to any description need not imply that the necessity of interpretation implies that it is impossible to descriptively carve literary texts at their joints. And that’s one of the things that I want from description, to carve texts at their joints.
Of course, one has to know how to do that. And THAT, it would seem, is far from obvious.
IMGP1971rd

Literary History, the Future: Kemp Malone, Corpus Linguistics, Digital Archaeology, and Cultural Evolution

In scientific prognostication we have a condition analogous to a fact of archery—the farther back you draw your longbow, the farther ahead you can shoot.
– Buckminster Fuller

The following remarks are rather speculative in nature, as many of my remarks tend to be. I’m sketching large conclusions on the basis of only a few anecdotes. But those conclusions aren’t really conclusions at all, not in the sense that they are based on arguments presented prior to them. I’ve been thinking about cultural evolution for years, and about the need to apply sophisticated statistical techniques to large bodies of text—really, all the texts we can get, in all languages—by way of investigating cultural evolution.

So it is no surprise that this post arrives at cultural evolution and concludes with remarks on how the human sciences will have to change their institutional ways to support that kind of research. Conceptually, I was there years ago. But now we have a younger generation of scholars who are going down this path, and it is by no means obvious that the profession is ready to support them. Sure, funding is there for “digital humanities” and so deans and department chairs can get funding and score points for successful hires. But you can’t build a profound and a new intellectual enterprise on financially-driven institutional gamesmanship alone.

You need a vision, and though I’d like to be proved wrong, I don’t see that vision, certainly not on the web. That’s why I’m writing this post. Consider it sequel to an article I published back in 1976 with my teacher and mentor, David Hays: Computational Linguistics and the Humanist. This post presupposes the conceptual framework of that vision, but does not restate nor endorse its specific recommendations (given in the form of a hypothetical program for simulating the “reading” of texts).

The world has changed since then and in ways neither Hays nor I anticipated. This post reflects those changes and takes as its starting point a recent web discussion about recovering the history of literary studies by using the largely statistical techniques of corpus linguistics in a kind of digital archaeology. But like Tristram Shandy, I approach that starting point indirectly, by way of a digression.

Who’s Kemp Malone?

Back in the ancient days when I was still an undergraduate, and we tied an onion in our belts as was the style at the time, I was at an English Department function at Johns Hopkins and someone pointed to an old man and said, in hushed tones, “that’s Kemp Malone.” Who is Kemp Malone, I thought? From his Wikipedia bio:

Born in an academic family, Kemp Malone graduated from Emory College as it then was in 1907, with the ambition of mastering all the languages that impinged upon the development of Middle English. He spent several years in Germany, Denmark and Iceland. When World War I broke out he served two years in the United States Army and was discharged with the rank of Captain.

Malone served as President of the Modern Language Association, and other philological associations … and was etymology editor of the American College Dictionary, 1947.

Who’d have thought the Modern Language Association was a philological association? Continue reading

Hurford

Jim Hurford: What is wrong, and what is right, about current theories of language, in the light of evolution? (2)

This post continues my summary of Jim Hurford’s discussion of two contrasting extreme positions on language evolution in his plenary talk at the Poznan Linguistic Meeting. Here’s the summary of these two positions from my last post:

Position A:

(1) There was a single biological mutation which (2) created a new unique cognitive domain, which then (3) immediately enabled the “unlimited command of complex structures via the computational operation of merge. (4) This domain is used primarily for advanced private thought and only derivatively for public communication. (5) It was not promoted by natural selection.

Position B:

(1) There were many cumulative mutations which (2) allowed the expanding interactions of pre-existing cognitive domains creating a new domain, which however is not characterized by principles unique to language. This then (3) gradually enabled the command of successively more complex structures. Also, on this view, this capacity was used primarily for public communication, and only derivatively for advanced private thought and was (5) promoted by natural selection.

Hurford criticized the position that the biological changes enabling languages primarily evolved for private thought, because this would imply that the first species in the Homo lineage that developed the capacity for unlimited combinatorial private thought (i.e. “merge”) were non-social and isolated clever hominids. This, as Hurford rightly points out, is quite unrealistic given everything we know about human evolution regarding, for example, competition, group size, neocortex side and tactical deception. There is in fact very strong evidence that what characterizes humans the most is the exact opposite as would be predicted by the “Merge developed in the service of enhancing private thought” position: We have the largest group size of any primate, the largest neocortex (which has been linked to the affordances of navigating a complex social world) and have the most pronounced capacity for tactical deception.

Continue reading

Hurford

Jim Hurford: What is wrong, and what is right, about current theories of language, in the light of evolution?

As I mentioned in my previous post, the 2012 Poznań Linguistic Meeting (PLM) features a thematic section on “Theory and evidence in language evolution research.” This section’s invited speaker was Jim Hurford, who is Emeritus Professor at Edinburgh University. Hurford is a very eminent figure in language evolution research and has published two very influential and substantive volumes on “Language in the Light of Evolution”: The Origins of Meaning (2007) and The Origins of Grammar (2011).

In his Talk, Hurford asked “What is wrong, and what is right, about current theories of language, in the light of evolution?” (you can find the abstract here).

Hurford presented two extreme positions on the evolution of language (which nevertheless are advocated by quite a number of evolutionary linguists) and then discussed what kinds of evidence and lines of reasoning support or seem to go against these positions.

Extreme position A, which basically is the Chomskyan position of Generative Grammar, holds that:

(1) There was a single biological mutation which (2) created a new unique cognitive domain, which then (3) immediately enabled the unlimited command of complex structures via the computational operation of merge. Further, according to this extreme position, (4) this domain is used primarily for advanced private thought and only derivatively for public communication and lastly (5) it was not promoted by natural selection.

On the other end of the spectrum there is extreme position B, which holds that:

(1) there were many cumulative mutations which (2) allowed the expanding interactions of pre-existing cognitive domains creating a new domain, which however is not characterized by principles unique to language. This then (3) gradually enabled the command of successively more complex structures. Also, on this view, this capacity was used primarily for public communication, and only derivatively for advanced private thought and was (5) promoted by natural selection.

Hurford then went on to discuss which of these individual points were more likely to capture what actually happened in the evolution of language.

He first looked at the debate over the role of natural selection in the evolution of language. In Generative Grammar there is a biological neurological mechanism or computational apparatus, called Universal Grammar (UG) by Chomsky, which determines what languages human infants could possibly acquire. In former Generative Paradigms, like the Government & Binding Approach of the 1980s, UG was thought to be extremely complex. What was more, some of these factors and structures seemed extremely arbitrary. Thus, from this perspective, it seemed inconceivable that they could have been selected for by natural selection. This is illustrated quite nicely in a famous quote by David Lightfoot:

“Subjacency has many virtues, but I am not sure that it could have increased the chances of having fruitful sex (Lightfoot 1991: 69)”

Continue reading

Poznań, Stary Rynek (3)

PLM2012 Coverage: Dirk Geeraerts: Corpus Evidence for Non-Modularity

The first plenary talk at this year’s Poznań Linguistic Meeting was by Dirk Geeraerts, who is professor of linguistics at the University of Leuven, Belgium.

In his talk, he discussed the possibility that corpus studies could yield evidence against the supposed modularity of language and mind endorsed by, for example, Generative linguists (you can find the abstract here)

Geeraerts began his talk by stating that there seems to be a paradigm shift in linguistics from an analysis of structure that is based on introspection to analyses of behaviour based on quantitative linguistic studies. More and more researchers are adopting quantified corpus-based analyses, which test hypotheses using statistical testing of language behaviour. As a data-set they use experimental data or large corpora. In his talk, he discussed the possibility that corpus studies could yield evidence against the supposed modularity of language and mind endorsed by, for example, Generative linguists (you can find the abstract here)

Multifactoriality

One further trend Geeraerts identified in this paradigm shift is that these kinds of analyses become more and more multifactorial in that they include multiple different factors which are both internal and external to language. Importantly, this way of doing linguistics is fundamentally different than the mainstream late 20th century view of linguistics.

What is important to note here when comparing this trend to other approaches to studying language is that multifactoriality goes against Chomsky’s idea of grammar as an ideal mental system that can be studied through introspection. In the traditional view, it is supposed that there is some kind of ideal language system which everyone has access to. This line of reasoning then justifies introspection as a method of studying the whole system of language and making valid generalizations about it. However, this goes against the emerging corpus linguistic view of language. On this view a random speaker is not representative for the linguistic community as a whole. The linguistic system is not homogenous across all speakers, and therefore introspection doesn’t suffice.

Modularity

The main thrust of Geeraerts’ talk was that research within this emerging paradigm also might call into question the assumption of the modularity of the mind (as advocated, for example by Jerry Fodor or Neil Smith): The view of the mind as a compartmentalized system consisting of discrete components or modules (for example, the visual system, language) plus a central processor.

Continue reading

Picture 55

The power of diversity: New Scientist recognises the growing work on social structure and linguistic structure

A feature article in last week’s New Scientist asks why there is so much linguistic diversity present in the world, and what are the forces that drive it.  The article reads like a who’s who of the growing field of language structure and social structure:  Mark Pagel, Gary Lupyan, Quentin Atkinson, Robert Munroe, Carol and Melvin Ember, Dan Dediu and Robert Ladd, Stephen Levinson (click on the names to see some Replicated Typo articles about their work).  This is practically as close as my subject will come to having a pull-out section in Vanity Fair.  Furthermore, it recognises the weakening grip of Chomskyan linguistics.

Commentators have already gotten hung-up on whether English became simplified before or after spreading, but this misses the impact of the article:  There is an alternative approach to linguistics which looks at the differences between languages and recognises social factors as the primary source of linguistic change.  Furthermore, these ideas are testable using statistics and genetic methods.  It’s a pity the article didn’t mention the possibility of experimental approaches, including Gareth Roberts’ work on emerging linguistic diversity and work on cultural transmission using the Pictionary paradigm (Simon Garrod, Nick Fay, Bruno Gallantucci, see here and here).

David Robson (2011). Power of Babel: Why one language isn’t enough New Scientist, 2842 Online

avml

Advances in Visual Methods for Linguistics (AVML2012)

Some peeps over the the University of York are organising a conference on the advances in visual methods for linguistics (AVML) to take place in September next year. This might be of interest to evolutionary linguists who use things like phylogenetic trees, networks, visual simulations or other fancy dancy visual methods. The following is taken from their website:

Linguistics, like other scientific disciplines, is centrally reliant upon visual images for the elicitation, analysis and presentation of data. It is difficult to imagine how linguistics could have developed, and how it could be done today, without visual representations such as syntactic trees, psychoperceptual models, vocal tract diagrams, dialect maps, or spectrograms. Complex multidimensional data can be condensed into forms that can be easily and immediately grasped in a way that would be considerably more taxing, even impossible, through textual means. Transforming our numerical results into graphical formats, according to Cleveland (1993: 1), ‘provides a front line of attack, revealing intricate structure in data that cannot be absorbed in any other way. We discover unimagined effects, and we challenge imagined ones.’ Or, as Keith Johnson succinctly puts it, ‘Nothing beats a picture’ (2008: 6).

So embedded are the ways we visualize linguistic data and linguistic phenomena in our research and teaching that it is easy to overlook the design and function of these graphical techniques. Yet the availability of powerful freeware and shareware packages which can produce easily customized publication-quality images means that we can create visual enhancements to our research output more quickly and more cheaply than ever before. Crucially, it is very much easier now than at any time in the past to experiment with imaginative and innovative ideas in visual methods. The potential for the inclusion of enriched content (animations, films, colour illustrations, interactive figures, etc.) in the ever-increasing quantities of research literature, resource materials and new textbooks being published, especially online, is enormous. There is clearly a growing appetite among the academic community for the sharing of inventive graphical methods, to judge from the contributions made by researchers to the websites and blogs that have proliferated in recent years (e.g. InfostheticsInformation is BeautifulCool InfographicsBBC Dimensions, or Visual Complexity).

In spite of the ubiquity and indispensability of graphical methods in linguistics it does not appear that a conference dedicated to sharing techniques and best practices in this domain has taken place before. This is less surprising when one considers that relatively little has been published specifically on the subject (exceptions are  Stewart (1976), and publications by the LInfoVisgroup). We think it is important that researchers from a broad spectrum of linguistic disciplines spend time discussing how their work can be done more efficiently, and how it can achieve greater impact, using the profusion of flexible and intuitive graphical tools at their disposal. It is also instructive to view advances in visual methods for linguistics from a historical perspective, to gain a greater sense of how linguistics has benefited from borrowed methodologies, and how in some cases the discipline has been at the forefront of developments in visual techniques.

The abstract submission deadline is the 9th January.

Great_Andamanese_-_boats_1875

Great Andamanese: The key to more than one linguistic puzzle?

Last week we had a lecture from Anvita Abbi on rare linguistic structures in Great Andamanese – a language spoken in the Andaman Islands.  The indigenous populations of the Andaman Islands lived in isolation for tens of thousands of years until the 19th Century, but still exhibit some common features of south-east Asian languages such as retroflex consonants.  This could be evidence for the migration route of humans from India to Australia.  Indeed, recent genetic research has shown that the Andamanese are descendants of the first human migration from Africa in the Palaeolithic, though Abbi suggested that the linguistic evidence is also a strong marker of human migration and an “important repository of our shared human history and civilization”.

Although the similarities are fascinating for studies of cultural evolution, the rarity of some structures in Great Andamanese are even more intriguing.

The Andaman Islands

Continue reading

OLYMPUS DIGITAL CAMERA

Tea Leaves and Lingua Francas: Why the future is not easy to predict

We all take comfort in our ability to project into the future. Be it through arbitrary patterns in Spring Pouchong tea leaves, or making statistical inferences about the likelihood that it will rain tomorrow, our accumulation of knowledge about the future is based on continued attempts of attaining certainty: that is, we wish to know what tomorrow will bring. Yet the difference between benignly staring at tea leaves and using computer models to predict tomorrow’s weather is fairly apparent: the former relies on a completely spurious relationship between tea leaves and events in the future, whereas the latter utilises our knowledge of weather patterns and then applies this to abstract from currently available data into the future. Put simply: if there are dense grey clouds in the sky, then it is likely we’ll get rain. Conversely, if tea-leaves arrange themselves into the shape of a middle finger, it doesn’t mean you are going to be continually dicked over for the rest of your life. Although, as I’ll attempt to make clear below, these are differences in degrees, rather than absolutes.

So, how are we going to get from tea-leaves to Lingua Francas? Well, the other evening I found myself watching Dr Nicholas Ostler give a talk on his new book, The Last Lingua Franca: English until the Return to Babel. For those of you who aren’t familiar with Ostler, he’s a relatively well-known linguist, having written several successful books popularising socio-historical linguistics, and first came to my attention through Razib Kahn’s detailed review of Empires of the Word. Indeed, on the basis of Razib’s post, I was not surprised by the depth of knowledge expounded during the talk. On this note alone I’m probably going to buy the book, as the work certainly filters into my own interests of historical contact between languages and the subsequent consequences. However, as you can probably infer from the previous paragraph, there were some elements I was slightly-less impressed with — and it is here where we get into the murky realms between tea-leaves and knowledge-based inferences. But first, here is a quick summary of what I took away from the talk:

Continue reading

Cultural differences in lateral transmission: Phylogenetic trees are OK for Linguistics but not biology

The three areas under analysis

An article in PLos ONE debunks the myth that hunter-gatherer societies borrow more words than agriculturalist societies. In doing so, it suggests that horizontal transmission is low enough for phylogenetic analyses to be a valid linguistic tool.

Lexicons from around 20% of the extant languages spoken by hunter-gatherer societies were coded for etymology (available in the supplementary material). The levels of borrowed words were compared with the languages of agriculturalist and urban societies taken from the World Loanword Database.  The study focussed on three locations:  Northern Australia, northwest Amazonia, and California and the Great Basin.

In opposition to some previous hypotheses, hunter-gatherer societies did not borrow significantly more words than agricultural societies in any of the regions studied.

The rates of borrowing were universally low, with most languages not borrowing more than 10% of their basic vocabulary.  The mean rate for hunter-gatherer societies was 6.38% while the mean for 5.15%.  This difference is actually significant overall, but not within particular regions.  Therefore, the authors claim, “individual area variation is more important than any general tendencies of HG or AG languages”.

Interestingly, in some regions, mobility, population size and population density were significant factors.  Mobile populations and low-density populations had significantly lower borrowing rates, while smaller populations borrowed proportionately more words.  This may be in line with the theory of linguistic carrying capacity as discussed by Wintz (see here and here).  The level of exogamy was a significant factor in Australia.

The study concludes that phylogenetic analyses are a valid form of linguistic analysis because the level of horizontal transmission is low.  That is, languages are tree-like enough for phylogenetic assumptions to be valid:

“While it is important to identify the occasional aberrant cases of high borrowing, our results support the idea that lexical evolution is largely tree-like, and justify the continued application of quantitative phylogenetic methods to examine linguistic evolution at the level of the lexicon. As is the case with biological evolution, it will be important to test the fit of trees produced by these methods to the data used to reconstruct them. However, one advantage linguists have over biologists is that they can use the methods we have described to identify borrowed lexical items and remove them from the dataset. For this reason, it has been proposed that, in cases of short to medium time depth (e.g., hundreds to several thousand years), linguistic data are superior to genetic data for reconstructing human prehistory “

Excellent – linguistics beats biology for a change!

However, while the level of horizontal transmission might not be a problem in this analysis, there may be a problem in the paths of borrowing.  If a language borrows relatively few words, but those words come from many different languages, and may have many paths through previous generations, there may be a subtle effect of horizontal transition that is being masked.  The authors acknowledge that they did not address the direction of transmission in a quantitative way.

A while ago, I did study of English etymology trying to quantify the level of horizontal transmission through time (description here).  The graph for English doesn’t look tree-like to me, perhaps the dynamics of borrowing works differently for languages with a high level of contact:

Claire Bowern, Patience Epps, Russell Gray, Jane Hill, Keith Hunley, Patrick McConvell, Jason Zentz (2011). Does Lateral Transmission Obscure Inheritance in Hunter-Gatherer Languages? PLoS ONE, 6 (9) : doi:10.1371/journal.pone.0025195