Example of the Method of Quantitative Text Analysis

Derek Miller, Harvard University

My research on Broadway history has focused primarily on historical data about productions, people, grosses, etc. Those data types are what I process and analyze in my charts. Much digital humanism in literary studies uses texts as data. This brief tutorial demonstrates the kind of work necessary to do even rudimentary quantitative analysis on a text. It means to answer the question, "If you do quantitative text analysis, what, specifically, are you doing?" The answer is that you're using computer programming languages to refine, parse, and analyze your data. The results look something like the code below.

The programming language used here is called R, developed by statisticians. (The first code block below will make R ready for use in this interface, called iPython Notebook, which was designed to combine narrative, code, and code output.) Prof. Matthew Jockers of the University of Nebraska, Lincoln, has written a useful, if somewhat abstract, Introduction to R for Literary Analysis. My example here condenses--and is heavily indebted to--his book.

What follows is a combination of brief explanatory notes, code, and outputs. I do not explain each line of code in detail. The code examples are meant to demonstrate specifically the kind of work entailed in this process.

In [1]:
%load_ext rpy2.ipython

Step 1: Importing and Preparing the Text

The first step in using R for textual analysis involves importing a text into R itself. I will take the example of Shaw's Plays: Pleasant and Unpleasant, with the goal of doing simple word frequency comparisons between the two volumes to test for linguistic differences between the unpleasant (Widowers' Houses, The Philanderer, Mrs Warren's Profession) and the pleasant (Arms and the Man, Candida, The Man of Destiny, You Never Can Tell) plays.

I will presume you begin with a reasonably clean edition of the text you wish to analyze. Any of the Project Gutenberg texts, for example, would suffice. A Word Document or a PDF without Optical Character Recognition (OCR) needs to be converted to plain text first. (In this case, I'm working from Project Gutenberg files of the individual plays, which I reformatted and regularized as HTML (the markup language of the World Wide Web) so as to reproduce Shaw's typographic style more closely. For an example of how such a text looks in a browser, see here. For some of the raw HTML, see here.) Once you have your text, you import it into R.

In [2]:
unpleasant.text <- scan("unpleasant.txt", what="character", sep="\n")
pleasant.text <- scan("pleasant.txt", what="character", sep="\n")
Read 2533 items
Read 3478 items

We now have the complete text of each volume in variables called, respectively, unpleasant.text and pleasant.text. They are currently stored as what are called vectors (basically, an array), in which each line from the file is one element. Because the files are in HTML format and primed to mirror Shaw's careful text styling, each element of the text (dialogue, speaker heading, stage direction, etc.) is marked up to indicate its function (and, by extension, its format). We can use that markup to help us extract only the dialogue.

To do so, we'll first get a list of all the lines in the file that begin with my HTML code for dialogue, which happens to be <p class="dialog">. We'll search for that using Regular Expressions. (Regular Expressions is a powerful text searching tool that allow you to use lots of different wildcards in your search. In this case, no such wildcards are necessary and our search is simple. But one could use regular expressions to search, say, an 18th-c. novel, seeking all the capitalized words that do not begin a sentence.)

In [3]:
unpleasant.dialogue.numbers <- grep("<p class=\"dialog\">", unpleasant.text)
pleasant.dialogue.numbers <- grep("<p class=\"dialog\">", pleasant.text)

Now that we know which lines are dialogue, we'll create new variables that include only those lines. We can check our work by looking at, say, the first three lines of dialogue of the Pleasant plays, which turn out to be from Arms and the Man.

In [4]:
unpleasant.dialogue.with.speakers <- unpleasant.text[unpleasant.dialogue.numbers]
pleasant.dialogue.with.speakers <- pleasant.text[pleasant.dialogue.numbers]
[1] "            <p class=\"dialog\"><a class=\"speaker\">Catherine</a> (<a class=\"stage\">entering hastily, full of good news</a></a>). Raina---(<a class=\"stage\">she pronounces it Rah-eena, with the stress on the ee</a></a>) Raina---(<a class=\"stage\">she goes to the bed, expecting to find Raina there.</a>) Why, where---(<a class=\"stage\">Raina looks into the room.</a>) Heavens! child, are you out in the night air instead of in your bed? You'll catch your death. Louka told me you were asleep.</p>"
[2] "            <p class=\"dialog\"><a class=\"speaker\">Raina</a> (<a class=\"stage\">coming in</a>). I sent her away. I wanted to be alone. The stars are so beautiful! What is the matter?</p>"                                                                                                                                                                                                                                                                                                                         
[3] "            <p class=\"dialog\"><a class=\"speaker\">Catherine</a>. Such news. There has been a battle!</p>"                                                                                                                                                                                                                                                                                                                                                                                                           

Step 2: Organizing Text

As you can see, there's still a lot of HTML in our text. We want to strip all that material so we end up only with the words spoken by Shaw's characters. We'll do that with more regular expressions, this time using find/replace and wildcard characters where the speaker names go, since we don't care who says which lines right now.

In [5]:
unpleasant.dialogue.with.speakers <- sub("<p class=\"dialog\">", "", unpleasant.dialogue.with.speakers)
pleasant.dialogue.with.speakers <- sub("<p class=\"dialog\">", "", pleasant.dialogue.with.speakers)
unpleasant.dialogue <- sub("<a class=\"speaker\">.*?</a>", "", unpleasant.dialogue.with.speakers)
pleasant.dialogue <- sub("<a class=\"speaker\">.*?</a>", "", pleasant.dialogue.with.speakers)
[1] "             (<a class=\"stage\">entering hastily, full of good news</a></a>). Raina---(<a class=\"stage\">she pronounces it Rah-eena, with the stress on the ee</a></a>) Raina---(<a class=\"stage\">she goes to the bed, expecting to find Raina there.</a>) Why, where---(<a class=\"stage\">Raina looks into the room.</a>) Heavens! child, are you out in the night air instead of in your bed? You'll catch your death. Louka told me you were asleep.</p>"
[2] "             (<a class=\"stage\">coming in</a>). I sent her away. I wanted to be alone. The stars are so beautiful! What is the matter?</p>"                                                                                                                                                                                                                                                                                                                     
[3] "            . Such news. There has been a battle!</p>"                                                                                                                                                                                                                                                                                                                                                                                                           

We still have to remove stage directions, which we can do, again, with regular expressions. Rather than discarding them, though, let's save them to their own variables.

In [6]:
stage.directions.pattern <- "<a class=\"stage\">.*?</a>"
locate.unpleasant.stage.directions <- gregexpr(stage.directions.pattern, unpleasant.dialogue)
locate.pleasant.stage.directions <- gregexpr(stage.directions.pattern, pleasant.dialogue)
unpleasant.stage.directions <- regmatches(unpleasant.dialogue, locate.unpleasant.stage.directions)
pleasant.stage.directions <- regmatches(pleasant.dialogue, locate.pleasant.stage.directions)
[1] "<a class=\"stage\">entering hastily, full of good news</a>"                  
[2] "<a class=\"stage\">she pronounces it Rah-eena, with the stress on the ee</a>"
[3] "<a class=\"stage\">she goes to the bed, expecting to find Raina there.</a>"  
[4] "<a class=\"stage\">Raina looks into the room.</a>"                           

Our result shows us that we've correctly captured the multiple stage directions in Catherine's first line. Now, let's strip all those stage directions from our dialogue.

In [7]:
unpleasant.dialogue <- gsub(stage.directions.pattern, "", unpleasant.dialogue)
pleasant.dialogue <- gsub(stage.directions.pattern, "", pleasant.dialogue)
[1] "             (</a>). Raina---(</a>) Raina---() Why, where---() Heavens! child, are you out in the night air instead of in your bed? You'll catch your death. Louka told me you were asleep.</p>"
[2] "             (). I sent her away. I wanted to be alone. The stars are so beautiful! What is the matter?</p>"                                                                                    
[3] "            . Such news. There has been a battle!</p>"                                                                                                                                          

Success! Now let's clean up our dialogue. By stripping any lurking traces of HTML or other odd formatting before moving on to punctuation and other issues.

In [8]:
unpleasant.dialogue <- gsub("(</a>|</p>|<i>|</i>|\\(\\))", "", unpleasant.dialogue)
pleasant.dialogue <- gsub("(</a>|</p>|<i>|</i>|\\(\\))", "", pleasant.dialogue)
unpleasant.dialogue <- gsub("(</a>|</p>|<i>|</i>|\\(\\))", "", unpleasant.dialogue)
pleasant.dialogue <- gsub("(</a>|</p>|<i>|</i>|\\(\\))", "", pleasant.dialogue)
unpleasant.dialogue <- gsub("^\\s*\\.\\s", "", unpleasant.dialogue, perl=TRUE)
pleasant.dialogue <- gsub("^\\s*\\.\\s", "", pleasant.dialogue, perl=TRUE)
[1] "Raina--- Raina--- Why, where--- Heavens! child, are you out in the night air instead of in your bed? You'll catch your death. Louka told me you were asleep."
[2] "I sent her away. I wanted to be alone. The stars are so beautiful! What is the matter?"                                                                      
[3] "Such news. There has been a battle!"                                                                                                                         

Much better!

Step 3: Regularizing Text

You may have noticed we have punctuation to worry about, as well as capitalized words. We can remove punctuation by converting our vector of lines of dialogue into a single blob of text, then converting that text to lower case and splicing it into an array of individual words. Finally, we will remove all items that are not words. Again, regular expressions are useful here, this time searching for breaks between words.

In [9]:
unpleasant.blob <- paste(unpleasant.dialogue, collapse=" ")
pleasant.blob <- paste(pleasant.dialogue, collapse=" ")
unpleasant.lower <- tolower(unpleasant.blob)
pleasant.lower <- tolower(pleasant.blob)
unpleasant.words <- unlist(strsplit(unpleasant.lower, "[^'[:alpha:]]"))
pleasant.words <- unlist(strsplit(pleasant.lower, "[^'[:alpha:]]"))
 [1] "raina"   ""        ""        ""        "raina"   ""        ""       
 [8] ""        "why"     ""        "where"   ""        ""        ""       
[15] "heavens" ""        "child"   ""        "are"     "you"     "out"    
[22] "in"      "the"     "night"   "air"     "instead" "of"      "in"     
[29] "your"    "bed"     ""        "you'll"  "catch"   "your"    "death"  
[36] ""        "louka"   "told"    "me"      "you"     "were"    "asleep" 
[43] ""        "i"       "sent"    "her"     "away"    ""        "i"      
[50] "wanted" 

Next, we will remove all the empty places where punctuation used to be.

In [10]:
unpleasant.words <- unpleasant.words[which(unpleasant.words!="")]
pleasant.words <- pleasant.words[which(pleasant.words!="")]
 [1] "raina"   "raina"   "why"     "where"   "heavens" "child"   "are"    
 [8] "you"     "out"     "in"      "the"     "night"   "air"     "instead"
[15] "of"      "in"      "your"    "bed"     "you'll"  "catch"   "your"   
[22] "death"   "louka"   "told"    "me"      "you"     "were"    "asleep" 
[29] "i"       "sent"    "her"     "away"    "i"       "wanted"  "to"     

Finally, we will clean up each list by eliminating "stop words". These are high-frequency words that appear in every text. While the frequency of their presence may be useful for certain kinds of analysis (for instance, author attribution), we will ignore them in this example.

In [11]:
stop.words <- scan("stoplist.txt", what="character", sep="\n")
unpleasant.words <- unpleasant.words[!unpleasant.words %in% stop.words]
pleasant.words <- pleasant.words[!pleasant.words %in% stop.words]
Read 610 items
 [1] "raina"     "raina"     "heavens"   "child"     "night"     "air"      
 [7] "bed"       "catch"     "death"     "louka"     "told"      "asleep"   
[13] "wanted"    "stars"     "beautiful" "matter"    "news"      "battle"   
[19] "ah"        "great"     "battle"    "slivnitza" "victory"   "won"      
[25] "sergius"  

We likely still have a lot of character names still in our text. Let's write some code to get a list of our names from our speaker headings, and remove those from our word lists.

In [12]:
speaker.pattern <- "<a class=\"speaker\">.*?</a>"
locate.unpleasant.speakers <- gregexpr(speaker.pattern, unpleasant.dialogue.with.speakers)
locate.pleasant.speakers <- gregexpr(speaker.pattern, pleasant.dialogue.with.speakers)
unpleasant.speakers <- regmatches(unpleasant.dialogue.with.speakers, locate.unpleasant.speakers)
pleasant.speakers <- regmatches(pleasant.dialogue.with.speakers, locate.pleasant.speakers)
unpleasant.speakers <- unlist(unpleasant.speakers)
pleasant.speakers <- unlist(pleasant.speakers)
unpleasant.speakers<- gsub("<a class=\"speaker\">","",unpleasant.speakers)
unpleasant.speakers<- gsub("</a>","",unpleasant.speakers)
unpleasant.speakers <- unique(unpleasant.speakers)
pleasant.speakers<- gsub("<a class=\"speaker\">","",pleasant.speakers)
pleasant.speakers<- gsub("</a>","",pleasant.speakers)
pleasant.speakers <- unique(pleasant.speakers)
speakers.to.remove.pattern <- "(The|A|Rev\\. S|&)"
remove.unpleasant.speakers <- grep(speakers.to.remove.pattern, unpleasant.speakers)
remove.pleasant.speakers <- grep(speakers.to.remove.pattern, pleasant.speakers)
unpleasant.speakers <- tolower(unpleasant.speakers[-remove.unpleasant.speakers])
pleasant.speakers <- tolower(pleasant.speakers[-remove.pleasant.speakers])
unpleasant.speakers <- gsub("mrs ","",unpleasant.speakers)
unpleasant.words <- unpleasant.words[!unpleasant.words %in% unpleasant.speakers]
pleasant.words <- pleasant.words[!pleasant.words %in% pleasant.speakers]

Step 4: Rudimentary Analysis

Now we'll see which words are most frequent in each text by making a relative frequency table for each and sorting it.

In [13]:
unpleasant.frequency <- table(unpleasant.words)
pleasant.frequency <- table(pleasant.words)
unpleasant.relative.frequency <- sort(100*(unpleasant.frequency/sum(unpleasant.frequency)), decreasing=TRUE)
pleasant.relative.frequency <- sort(100*(pleasant.frequency/sum(pleasant.frequency)), decreasing=TRUE)

Let's look at the 25 most frequently used words in the Unpleasant plays.

In [14]:
 [1] "good"     "sir"      "man"      "woman"    "miss"     "mother"  
 [7] "make"     "money"    "people"   "mind"     "business" "suppose" 
[13] "life"     "matter"   "father"   "harry"    "love"     "made"    
[19] "hope"     "put"      "time"     "back"     "daughter" "young"   
[25] "day"     

And the Pleasant plays.

In [15]:
 [1] "sir"        "good"       "love"       "miss"       "father"    
 [6] "james"      "make"       "young"      "clandon"    "give"      
[11] "mind"       "people"     "general"    "mother"     "woman"     
[16] "back"       "ah"         "understand" "things"     "eugene"    
[21] "put"        "thought"    "children"   "suppose"    "gentleman" 

Let's see the list of unique words in the top 25 of both texts.

In [16]:
unique(c(names(unpleasant.relative.frequency[1:25]), names(pleasant.relative.frequency[1:25])))
 [1] "good"       "sir"        "man"        "woman"      "miss"      
 [6] "mother"     "make"       "money"      "people"     "mind"      
[11] "business"   "suppose"    "life"       "matter"     "father"    
[16] "harry"      "love"       "made"       "hope"       "put"       
[21] "time"       "back"       "daughter"   "young"      "day"       
[26] "james"      "clandon"    "give"       "general"    "ah"        
[31] "understand" "things"     "eugene"     "thought"    "children"  
[36] "gentleman" 

And just the words both volumes share in common.

In [17]:
names(unpleasant.relative.frequency[which(names(unpleasant.relative.frequency[1:25]) %in% names(pleasant.relative.frequency[1:25]))])
 [1] "good"    "sir"     "woman"   "miss"    "mother"  "make"    "people" 
 [8] "mind"    "suppose" "father"  "love"    "put"     "back"    "young"  

And the words unique to the Unpleasant plays list, then the Pleasant plays list.

In [18]:
words.present.in.both <- which(names(unpleasant.relative.frequency[1:25]) %in% names(pleasant.relative.frequency[1:25]))
 [1] "man"      "money"    "business" "life"     "matter"   "harry"   
 [7] "made"     "hope"     "time"     "daughter" "day"     

In [19]:
words.present.in.both <- which(names(pleasant.relative.frequency[1:25]) %in% names(unpleasant.relative.frequency[1:25]))
 [1] "james"      "clandon"    "give"       "general"    "ah"        
 [6] "understand" "things"     "eugene"     "thought"    "children"  
[11] "gentleman" 

We might begin to draw some basic conclusions from these lists. The conclusions, because based on such simple analysis, are not surprising, but they tell us our analysis is moving in the right direction. For example, in the shared words list, we find a vocabulary of address: "sir", "miss". That "sir" should be in the top two of both volumes underscores how profoundly Shaw's dialogue forms around not just conversation, but polite conversation. His philosophical dialogues rely on a veneer of gentility, even among characters who disavow such courtesies. We might see traces, too, of Shaw's fundamentally comic form: the word "love" amidst such words as "young", "mother", and "father" suggest the structure of every comic marriage plot. The prominence of "woman", "miss", and "mother" in both volumes (but "man" only in the Unpleasant plays) underline how much these plays are about men coming to terms with a new kind of woman (Vivie Warren, Candida).

Distinctions between the two volumes also affirm the validity of our method. Unsurprisingly, the Unpleasant plays are about "money" and "business", both operating under the pressure of "time". The Pleasant plays offer a less revealing list. A number of lingering character names occupy important space. If anything, the list makes clear how little the Pleasant plays share in common: Arms and the Man's anti-militarism, Candida's theological and philosophical treatment of marriage, The Man of Destiny's biographical exploration of Napoleon, and You Never Can Tell's Wildean comedy really don't belong in a volume together, except chronologically. Shaw's Pleasant plays do not share a common pleasurable vocabulary, particularly compared to the displeasing language of business that unites the Unpleasant plays.

These are rudimentary experiments on a small corpus, so naturally the analyses are extremely preliminary as well as primitive. Other versions of this work might take in a larger Shavian corpus, compare stage directions, parse vocabulary by speaker, focus on hapax legomena, or contrast Shaw's language with that of a contemporary such as Wilde, Pinero, or, perhaps most enticingly, the William Archer-edited editions of Henrik Ibsen's plays. The analytic techniques Jockers uses in his volume Macroanalysis, for example, are more advanced than those I've shown here.

I hope to have given you a sense, however, of the actual tasks involved in conducting this analysis. There is nothing sophisticated about what I have done above. One must, however, learn some of the programming language R to begin this analysis, and then learn how to apply and adapt it (and its advanced tools) to a specific case. On the one hand, such analysis is a mechanical activity: one must find the proper statistical recipe and cook up the text to produce a result. But each text presents its own tricks and problems. For instance, Jockers' sample text is Moby Dick, which he treated as a single corpus and, in later lessons, divided by chapter. I needed to separate stage directions and speaker headings from my dramatic texts. To do that, I had to regularize Shaw's text (in this case, into HTML), then create Regular Expression searches that parsed that marked up text into its component parts. Quantitative text analysis always requires some invention and experimentation in the code. In practice, it looks something like the short example above.