Over at New Savanna I've been blogging my way though Matthew Jockers, Macroanalysis: Digital Methods & Literary History, University of Illinois Press, 2013. I figured this particular post would be of interest here. If you're not familiar with topic analysis, there's some links below that'll help you out.
Chapter 8 of Macroanalysis is about “Theme.” Jockers uses topic analysis to investigate the occurrence of 500 ‘themes’ in a corpus of 3,346 19th-century British, American, and Irish books. He opens with a bit of intellectual history, from the Russin Formalists to Google’s Ngrams; then he launches into topic analysis, which emerged at the turn of the millennium he gives some simple examples, and then he gets serious. But I’m going to skip over all of that for now.
For one thing, I’ve been through the topic analysis drill several times in the past year or so and don’t want to go through it again. If you need an introduction or a review, check out Topic Models: Strange Objects, New Worlds, or, in this series, Reading Macroanalysis 5: An Interlude on Scale: Micro, Meso, and Macro. For another, Jockers has put a topic tool online, 500 Themes from a corpus of 19th-Century Fiction. Those are the topics he discusses in this chapter.
Once I was done reading the chapter I started playing with the tool. I’d pick a topic and then look at the graphics:
- a word cloud to display the most frequent words in the topic,
- a bar chart indicating usage of the topic by author gender (male, female, and undetermined),
- a line graph showing gender usage over time,
- a bar chart indicating usage of topic by author nationality (American, British, Irish).,and
- a line graph showing national usage over time.
At first I was just browsing, moving from one theme to the next. But then I hit one that grabbed my attention.
So I spent the next couple of hours looking at themes and thinking about them. I’m going to devote the rest of this post and the next one showing what I found. Then I’ll do a third post where I review what Jockers found and recast the enterprise in terms of cultural evolution.
Note that in all of this I’m just playing around, but in a serious way. It is all preliminary and provisional. I haven’t reached any firm conclusions on the particular themes I look at. The only thing I’m sure about is that this, and similar techniques, are going to revolutionize the way we do literary history.
Before proceeding on, however, two caveats are necessary. While the Jockers’ is substantial it isn’t every British, American, and Irish novel written in the 19th Century. Perhaps more important, it is natural to read these theme charts as reflecting the interests of the 19th Century reading public. And in some sense that is so. But we have to be careful.
For some of these books were more widely read than others and a few of them, the canonical ones, are still being read. But the extent of a books’ readership is not reflected in the data. The fact that a book was published at all implies, of course, that someone thought there was an audience for it. But a publisher’s interest isn’t quite the same as a reader’s interest. We simply don’t know how accurately publisher interest tracks reader interest. With those reservations in mind, let’s take a look.
Of Dogs and Gold
In the course of browsing through Jockers’ themes menu I saw “DOGS.” Let’s look at that, I thought. Why dogs? you may ask. No deep reason, but some years ago, way back in graduate school in fact, I’d noticed that dogs figured as a significant motif in Wuthering Heights. Major transitions among humans were marked by violence between dogs and humans (e.g. Lockwood arrives and is greeted by a barking dog, Catherine gets bitten by Skulker; see this post). More recently, I’d read a handful of articles about the domestication of dogs during human evolutionary history. I was just curious.
Here’s the word cloud for the DOGS topic:
The following graph stunned me. It depicts the occurrence of the dog topic by author’s gender over the course of the century. The medium gray line depicts male authors, the black line females, and the light gray line, authors where the gender was undetermined.
What’s that spike at the right edge? As soon as I realized that it was for male authors I thought, “Jack London, Call of the Wild.” I also had some doubts as to whether that book was in the corpus, as I didn’t believe the book was 19th Century, though I wasn’t sure. But that doubt didn’t stop me from nosing around. By the time I’d confirmed for myself that it wasn’t 19th Century (it was published in 1903) and Jockers had gotten back to me that, no, it wasn’t in the corpus, I’d already had too much fun browsing through the charts and had moved on to other topics (which I’ll get to in the next post).
But just what would I have learned if Call of the Wild had been in the corpus and it ‘caused’ that spike? Not much I warrant. For one thing, there were over 3000 books in the corpus; would one book have been enough to cause that spike? What other books were there? Even if I knew, that at most gets me a formal cause, if you will. But it doesn’t tell me why, all of a sudden, we have male authors writing about dogs–efficient cause, if you will. And that’s what we want to know, right?
Let’s return to Call of the Wild. It was enormously popular and eventually made it into the canon. Why?
Its canonical status allows us to say that, sure, it sold well because it was so very good. But I don’t find that at all satisfying. There are non canonical books that sell a million–does anyone believe that The da Vinci Code is going to become canonical?–and there are canonical books that took awhile to find an audience, such as Moby Dick.
No, I figure the book was an instant hit because people were interested in whatever it is that the book is about, and that’s dogs and the Yukon, the Yukon during the period of the Klondike Gold Rush. Gold was discovered in there in August of 1896, sending 100,000 prospectors to the Klondike over the next three years (at which point gold fever moved to Alaska). Is there anything in Jockers’ data that can reasonably taken as an indicator of the Klondike Gold Rush?
Well, look at this graph, which depicts yearly interest in the GOLD AND TREASURE topic:
The black line depicts American authors. Notice that the lines for British (medium gray) and Irish (light gray) authors don’t even burp at that point, much less spike. So, the nationality is right, and the time seems right as well.
Actually, the peak seems to be before gold was discovered and it’s dropping precipitously during the discovery. Not good. But the lines depict moving averages and when you get to the end of the period, there’s nothing to average in part of your window. So maybe I need to get a closer look at the corpus and the data. In any event, if that line isn’t a reflection of the Klondike Gold Rush, what is it? Like the DOGS spike, it’s not a moderate or small change, it’s big. Something’s going on.
Yes, something IS going on, but it might not have anything to do with the 19th century. We’ll get to that at the end of this post.
For the sake of argument let’s say that that spike DOES reflect some of the cultural impact of the Klondike. That leaves us with another problem. That wasn’t the only major 19th Century gold rush. California had a gold rush in the middle of the century, before the Civil War. There’s nothing in this graph of the GOLD AND TREASURE topic that answers to the California Gold Rush (1848–1855).
If the Klondike Gold Rush prompted American authors to write about gold, why didn’t the California Gold Rush do the same? I don’t know. Why was gold fever culturally salient at the end of the century, but not in the middle of the century? My mind turned immediately to slavery and the Civil War; perhaps that’s what was sapping people’s attention. But let’s not look into that at this moment. I want to say a bit more about gold.
Gold and Gender
GOLD AND TREASURE isn’t the only gold topic. We’ve also got GOLD AND JEWELS. How does it behave? Here’s the graph depicting the appearance of the topic over time by nation:
The three lines – American (black), British (medium gray), and Irish (light gray) – have somewhat different patterns, and each is a bit spiky, the Irish and American more so than the British. And, while there’s an end-of-century spike in the American line, it’s not nearly so dramatic as that for GOLD AND TREASURE. There’s healthy mid-century action in each of these trends. But the Irish curve is the only one that shows an increase from the late 1840s and into the 1850s, when the California Gold Rush happened.
Now, let’s look at the gender charts for the two topics. GOLD AND TREASURE is first, then GOLD AND JEWELS.
The GOLD AND TREASURE topic tends male while the GOLD AND JEWELS topic tends female. The word clouds for the two topics seem consistent with that bias:
In GOLD AND TREASURE men, mine, mines, miner, pit, shaft, earth, lead, nugget, and so forth show up. For GOLD AND JEWELS it’s diamonds, necklace, rings, bracelet, pearls, rings, ornaments, and so forth. The topics are quite different in content and the content is consistent with the gender preferences.
This doesn’t really add anything to our story about gold, dogs, the Klondike and the end of the century, though it’s interesting in the way that rummaging around in new stuff is interesting, so let’s take a look at mid-century and slavery. Why slavery? That surely was of interest in America and that interest must have been very high at mid-century with the growth of abolition (and Uncle Tom’s Cabin) and the approach of the Civil War. It is not simply that slavery existed at the time, but that it had become controversial and a matter of public concern. It was not merely in the background as one of those things constituting the unspoken way of the world.
Slavery and Awakening
Here’s the word cloud for the AMERICAN SLAVERY topic:
Notice the bits and pieces of pseudo dialect: dat, jes, mus, sah, debbil, and so forth. The following chart, which Jockers uses in the book (Figure 8.12, p. 141), shows that this topic of much greater to interest to American than to British authors; but note the level of interest among Irish authors:
This next graph depicts the yearly behavior of the topic by nation (black: American, medium gray: British, light gray: Irish):
The British curve is pretty uniform through the century, and low. We have a mid-century rise for the American curve (though notice the sharp dip at 1860), as expected, and a remarkable spike in the Irish curve during the Civil War. I’m sure there’s a lot to say about that spike, but not by me. I want to move on.
Here’s the word cloud for a different topic, BIBLICAL LANGUAGE, followed by charts for gender use, national use, and nation over time:
These charts tell us:
- that biblical language is used more by men than women,
- way more by Americans than British or Irish (and more by the British than the Irish), and
- that American use peaked dramatically at about 1840, far more than British or Irish.
That, I warrant, is the effect of the Second Great Awakening, a religious revival that swept through America starting at the end of the 18th Century and was on the way down by the end of the 1840s. The Second Awakening had a significant influence on abolitionism.
The Landscape of Public Attention
Now, you recall that we started looking at topic action in mid-century America because we were curious about the temporal distribution of the GOLD AND TREASURE topic. There were gold rushes at mid-century and at the end of the century, but only the latter seems to have shown up in that distribution. Why?
I was thinking that public attention was attracted to slavery, abolition, and the Civil War, and wanted to look at topic action that could reasonably be attributed to those historical actors. AMERICAN SLAVERY and BIBLICAL LANGUAGE both more or less fit with those historical circumstances, which is not surprising. No doubt other topics show movement that we can attribute to those events. But does that allow us to conclude that it is these events that took the public’s attention away from the California Gold Rush?
I don’t think so. For that conclusion presumes that gold rushes have an automatic claim on public attention such that something with a more powerful claim would have to take place in order to override that intrinsic interest in gold rushes. It’s not at all obvious to me that this is so.
Imagine that, for whatever reason, there had been no Second Awakening, no Uncle Tom’s Cabin, and no rise in abolitionist sentiment. Imagine that slavery was simply taken for granted in America at mid-century. Would the California Gold Rush then have pushed the GOLD AND TREASURE topic to the fore? There’s no way to tell.
What else was going on at the time? And, since we’re now thinking of alternative history, what else might have happened? Who knows? On the other hand, just what is it that seemed to have made the Klondike Gold Rush worthy of public attention? What was going on in the public conversation during the 1890s that primed the public mind for gold rush stories, IF that is indeed what we’re seeing in those two century-end spikes, the ones for male authors in the DOGS topic and the GOLD AND TREASURE topic?
Beware of Artifacts in the Data
I’ve added that last caveat because we might, in fact, just be looking at an artifact of the collection. After I’d been though all this and was thinking things through I read a preprint for, Jockers and Mimno, “Significant Themes in 19th-Century Literature,” Poetics 41 (6), 2013, 750-769. That article contains this chart, which depicts the number of texts in the corpus for each year, running between 1750 and 1899:
There aren’t many books in the corpus at the end of the 19th Century. There appears to be about 20 books in the corpus for 1896, the year of the Klondike strike. Since that happened in August it’s difficult to see how it could have had much of an influence on books published in that year. There are less than ten books for 1897, and less than 5 for 1898 and 1899. And those bars reflect books for all three literatures, American, British, and Irish, not just American, and by all genders, not just male.
If only one or two American males had published a book or three featuring DOGS and GOLD AND TREASURE, that might have produced those spikes; in which case we have no evidence for public interest in those two topics.
Did I have all this fun–and this is only part of it–because I got fooled into over-reading an artifact in the data? Quite possibly. Does that possibility disappoint me? No. Why not?
For one thing, I caught the problem myself. No one’s going to expose me on THAT one.
More importantly, I wasn’t doing this because I had a particular idea I was pursing, an idea I was heavily invested in. I did it because I wanted to play around in a new way. I wanted to see what kind of investigation became possible with the tool Jockers had hung out there on the web.
Those possibilities haven’t disappeared just because one particular hypothesis seems to have dissipated. Right now it’s play that’s the thing.
* * * * *
- Reading Macroanalysis 1: Framing: Hyperobjects, Objectification, and Evolution
- Reading Macroanalysis 2: Metadata and the Emperor’s New Clothes
- Reading Macroanalysis 2.1: How do we make inferences from patterns in collections of books to patterns in populations of readers?
- Reading Macroanalysis 3.0: Style, or the Author Comes Back from the Dead
- Reading Macroanalysis 3.1: Style, or Measuring the Autonomous Aesthetic Realm
- Reading Macroanalysis 4: On the matter of “the”
- Reading Macroanalysis 5: An Interlude on Scale: Micro, Meso, and Macro