Notes toward a theory of the corpus, Part 1: History

By corpus I mean a collection of texts. The texts can be of any kind, but I am interested in literature, so I’m interested in literary texts. What can we infer from a corpus of literary texts? In particular, what can we infer about history?

Well, to some extent, it depends on the corpus, no? I’m interested in an answer which is fairly general in some ways, in other ways not. The best thing to do is to pick an example and go from there.

The example I have in mind is the 3300 or so 19th century Anglophone novels that Matthew Jockers examined in Macroanalysis(2013 – so long ago, but it almost seems like yesterday). Of course, Jockers has already made plenty of inferences from that corpus. Let’s just accept them all more or less at face value. I’m after something different.

I’m thinking about the nature of historical process. Jockers’ final study, the one about influence, tells us something about that process, more than Jockers seems to realize. I think it tells us that cultural evolution is a force in human history, but I don’t intend to make that argument here. Rather, my purpose is to argue that Jockers has created evidence that can be brought to bear on that kind of assertion. The purpose of this post is to indicate why I believe that.

A direction in a 600 dimension space

In his final study Jockers produced the following figure (I’ve superimposed the arrow):

Each node in that graph represents a single novel. The image is a 2D projection of a roughly 600 dimensional space, one dimension for each of the 600 features Jockers has identified for each novel. The length of each edge is proportional to the distance between the two nodes. Jockers has eliminated all edges above a certain relatively small value (as I recall he doesn’t tell us the cut off point). Thus two nodes are connected only if they are relatively close to one another, where Jockers takes closeness to indicate that the author of the more recent novel was influenced by the author of more distant one.

Each node in that graph represents a single novel. The image is a 2D projection of a roughly 600 dimensional space, one dimension for each of the 600 features Jockers has identified for each novel. The length of each edge is proportional to the distance between the two nodes. Jockers has eliminated all edges above a certain relatively small value (as I recall he doesn’t tell us the cut off point). Thus two nodes are connected only if they are relatively close to one another, where Jockers takes closeness to indicate that the author of the more recent novel was influenced by the author of more distant one.

You may or may not find that to be a reasonable assumption, but let’s set it aside. What interests me is the fact that the novels in this are in rough temporal order, from 1800 at the left (gray) to 1900 at the right (purple). Where did that order come from? There were no dates in 600D description of each novel. As far as I can tell, that must be a product of the historical process that produced those texts. That process must therefore have a temporal direction.

I’ve spent a fair amount of effort explicitly arguing that point [1], but don’t want to reprise that argument here. For the purposes of this piece, assume that that argument is at least a reasonable one to make.

What is that direction? I don’t have a name for it, but that’s what the arrow in the image indicates. One might call it Progress, especially with Hegel looking over your shoulder. And I admit to a bias in favor of progress, though I have no use for the notion of some ultimate telostoward which history tends. But saying that direction is progress is a gesture without substantial intellectual content because it doesn’t engage with the terms in which that 600D space is constructed. What are those terms? Some of them are topics of the sort identified in topic analysis, e.g. American slavery, beauty and affection, dreams and thoughts, Greek and Egyptian gods, knaves rogues and asses, life history, machines and industry, misery and despair, scenes of natural beauty, and so on [3]. Others are stylistic features, such as the frequency of specific words, e.g. the, heart, would, me, lady, which are the first five words in a list Jockers has in the “Style” chapter of Macroanalysis(p. 94).

In a post back in 2014 I suggested that Jockers’ image depicts the Geistof 19th century Anglo-American literary culture [2]. That’s what interests me, the possibility that we’re looking at a 21st century operationalization of an idea from 19th century German idealism. Here’s what the Stanford Encyclopedia of Philosophy has to say about Hegel’s conception of history [4]:

In a sense Hegel’s phenomenology is a study of phenomena (although this is not a realm he would contrast with that of noumena) and Hegel’s Phenomenology of Spirit is likewise to be regarded as a type of propaedeutic to philosophy rather than an exercise in or work of philosophy. It is meant to function as an induction or education of the reader to the standpoint of purely conceptual thought from which philosophy can be done. As such, its structure has been compared to that of a Bildungsroman (educational novel), having an abstractly conceived protagonist—the bearer of an evolving series of so-called shapes of consciousness or the inhabitant of a series of successive phenomenal worlds—whose progress and set-backs the reader follows and learns from. Or at least this is how the work sets out: in the later sections the earlier series of shapes of consciousness becomes replaced with what seem more like configurations of human social life, and the work comes to look more like an account of interlinked forms of social existence and thought within which participants in such forms of social life conceive of themselves and the world. Hegel constructs a series of such shapes that maps onto the history of western European civilization from the Greeks to his own time.

Now, I am not proposing that Jockers’ has operationalized that conception, those “so-called shapes of consciousness”, in any way that could be used to buttress or refute Hegel’s philosophy of history – which, after all, posited a final end to history. But I am suggesting that can we reasonably interpret that image as depicting a (single) historical phenomenon, perhaps even something like an animating ‘force’, albeit one requiring a thoroughly material account. Whatever it is, it is as abstract as the Hegelian Geist.

How could that be? Continue reading “Notes toward a theory of the corpus, Part 1: History”

Corpus Linguistics, Literary Studies, and Description

One of my main hobbyhorses these days is description. Literary studies has to get a lot more sophisticated about description, which is mostly taken for granted and so is not done very rigorously. There isn’t even a sense that there’s something there to be rigorous about. Perhaps corpus linguistics is a way to open up that conversation.
The crucial insight is this: What makes a statement descriptive IS NOT how one arrives at it, but the role it plays in the larger intellectual enterprise.

A Little Background Music

Back in the 1950s there was this notion that the process of aesthetic criticism took the form of a pipeline that started with description, moved on to analysis, then interpretation and finally evaluation. Academic literary practice simply dropped evaluation altogether and concentrated its efforts on interpretation. There were attempts to side-step the difficulties of interpretation by asserting that one is simply describing what’s there. To this Stanley Fish has replied (“What Makes an Interpretation Acceptable?” in Is There a Text in This Class?, Harvard 1980, p. 353):

 

The basic gesture then, is to disavow interpretation in favor of simply presenting the text: but it actually is a gesture in which one set of interpretive principles is replaced by another that happens to claim for itself the virtue of not being an interpretation at all.

 

And that takes care of that.
Except that it doesn’t. Fish is correct in asserting that there’s no such thing as a theory-free description. Literary texts are rich and complicated objects. When the critic picks this or that feature for discussion those choices are done with something in mind. They aren’t innocent.
But, as Michael Bérubé has pointed out in “There is Nothing Inside the Text, or, Why No One’s Heard of Wolfgang Iser” (in Gary Olson and Lynn Worsham, eds. Postmodern Sophistries, SUNY Press 2004, pp. 11-26) there is interpretation and there is interpretation and they’re not alike. The process by which the mind’s eye makes out letters and punctuation marks from ink smudges is interpretive, for example, but it’s rather different from throwing Marx and Freud at a text and coming up with meaning.
Thus I take it that the existence of some kind of interpretive component to any description need not imply that the necessity of interpretation implies that it is impossible to descriptively carve literary texts at their joints. And that’s one of the things that I want from description, to carve texts at their joints.
Of course, one has to know how to do that. And THAT, it would seem, is far from obvious.

Literary History, the Future: Kemp Malone, Corpus Linguistics, Digital Archaeology, and Cultural Evolution

In scientific prognostication we have a condition analogous to a fact of archery—the farther back you draw your longbow, the farther ahead you can shoot.
– Buckminster Fuller

The following remarks are rather speculative in nature, as many of my remarks tend to be. I’m sketching large conclusions on the basis of only a few anecdotes. But those conclusions aren’t really conclusions at all, not in the sense that they are based on arguments presented prior to them. I’ve been thinking about cultural evolution for years, and about the need to apply sophisticated statistical techniques to large bodies of text—really, all the texts we can get, in all languages—by way of investigating cultural evolution.

So it is no surprise that this post arrives at cultural evolution and concludes with remarks on how the human sciences will have to change their institutional ways to support that kind of research. Conceptually, I was there years ago. But now we have a younger generation of scholars who are going down this path, and it is by no means obvious that the profession is ready to support them. Sure, funding is there for “digital humanities” and so deans and department chairs can get funding and score points for successful hires. But you can’t build a profound and a new intellectual enterprise on financially-driven institutional gamesmanship alone.

You need a vision, and though I’d like to be proved wrong, I don’t see that vision, certainly not on the web. That’s why I’m writing this post. Consider it sequel to an article I published back in 1976 with my teacher and mentor, David Hays: Computational Linguistics and the Humanist. This post presupposes the conceptual framework of that vision, but does not restate nor endorse its specific recommendations (given in the form of a hypothetical program for simulating the “reading” of texts).

The world has changed since then and in ways neither Hays nor I anticipated. This post reflects those changes and takes as its starting point a recent web discussion about recovering the history of literary studies by using the largely statistical techniques of corpus linguistics in a kind of digital archaeology. But like Tristram Shandy, I approach that starting point indirectly, by way of a digression.

Who’s Kemp Malone?

Back in the ancient days when I was still an undergraduate, and we tied an onion in our belts as was the style at the time, I was at an English Department function at Johns Hopkins and someone pointed to an old man and said, in hushed tones, “that’s Kemp Malone.” Who is Kemp Malone, I thought? From his Wikipedia bio:

Born in an academic family, Kemp Malone graduated from Emory College as it then was in 1907, with the ambition of mastering all the languages that impinged upon the development of Middle English. He spent several years in Germany, Denmark and Iceland. When World War I broke out he served two years in the United States Army and was discharged with the rank of Captain.

Malone served as President of the Modern Language Association, and other philological associations … and was etymology editor of the American College Dictionary, 1947.

Who’d have thought the Modern Language Association was a philological association? Continue reading “Literary History, the Future: Kemp Malone, Corpus Linguistics, Digital Archaeology, and Cultural Evolution”