Skip to main content

Keystone DH, Session 14: thoughts on methodology &c.

9 min read

First presentation: “The Napoleonic Theater Corpus: towards a representative corpus of nineteenth-century French” by Angus B. Grieve-Smith, a linguist. His research looked at 19th century plays as a way of studying how French negation had changed over time. He randomly selected four plays, one from each quarter of the century, from two different sources. The first was the FRANTEXT corpus, a collection of 2900+ French texts that literary scholars viewed as significant, compiled in the 60s on punch cards and paper tape then converted into magnetic media in the 90s. The second source was a list of all French plays published in the 19th century, compiled by Beaumont Wicks, who spent about 30 years on the project—like a much more involved version of our 1760s project?

This study let Grieve-Smith see how “ne pas” was becoming an increasingly popular form of negation in French at this time, but it also provided an argument for the importance of looking at less canonical texts when making broader arguments about the literature or language of a particular period. “Ne pas” made up 87% of negations in the plays randomly sampled from the list of all plays published during the century, but only 49% of those in the plays chosen as significant by scholars. Grieve-Smith suggested that this could be because the plays now seen as canonical disproportionately deal with nobility who speak a more formal, traditional version of French, while the full list of plays includes more about ordinary people whose speech is more likely to reflect how the average person would have spoken at the time. This seemed in line with Moretti’s argument about the importance of distant reading, and how we may see particular texts as worthy of study specifically because they’re exceptional, which then makes it difficult to get a sense of what an “average” text from the period would have looked like.


Second presentation: “On Creating the Digital Joyce Word Dictionary” by Natasha Chenier (online at Chenier started out by talking about how frequently James Joyce is cited in different editions of the Oxford English Dictionary. The first edition was compiled in 1928 and tended to draw its citations from canonical authors, but Joyce wasn’t yet famous enough to be included. In the next two editions, however, he became the most commonly cited modernist writer, with 1800 citations in the second edition (many of them for obscenities, which were added for the first time in this edition) and 2400 in the third. The vast majority of these citations were from Ulysses, suggesting that the OED’s focus was on “great works” rather than “great writers”.

Chenier then focused in on the inclusion (and exclusion) of Joyce’s neologisms in the OED. Again, the majority of these (74/89 in one edition) came from Ulysses; she suggested that while Finnegan’s Wake includes more neologisms, their meaning is usually so unclear that giving an accurate definition would be difficult. Chenier also noted that OED’s choice of which neologisms to include seemed somewhat arbitrary, which was why she was beginning to create an online dictionary of Joyce’s neologisms.

I very much enjoyed hearing the way Chenier talked about the principles behind this project. She talked about wanting to make the site aesthetically appealing, and to present it in a way that made it look inviting to a non-academic audience. She also wanted the project to align with Joyce’s “or, more accurately Leopold Bloom’s politics” (referring to the protagonist of Ulysses; I understood this to mean something like openness/flexibility/acceptance of multiple interpretations and contradictory information). In practice, this meant that anyone could submit a definition to the dictionary, and the site would allow multiple definitions or interpretations to be displayed alongside each other rather than showing only one authoritative version. This also seemed like a good way around issues like the ones OED might have faced with neologisms from Finnegan’s Wake, where there were so many possible interpretations of each word’s meaning that it was impossible to arrive at an authoritative one.


Third presentation: Literary Periodization and the (D)evolution of Distinctive Gender Markers, based on research done by Sean G. Weidman and James O’Sullivan, and presented by Weidman. According to Weidman, this study was meant to build off a paper by David Hoover titled “Textual Analysis.” Hoover compared 13 male and female contemporary poets and apparently used this to make claims about gendered differences in word use. Pulling a quote from the paper: “Relatively common words like mother are found in twenty women’s sections but only eleven men’s […] Female markers like children and mirrors and male markers like beer and lust seem almost stereotypical, but there are also surprises, like the female marker fist and the male markers song and dancing.” (Source:

Weidman said that he and O’Sullivan were skeptical of the conclusions Hoover drew, wondering whether it fit too closely with gender stereotypes, and set out to do a similar study with a larger sample size. They looked at prose instead of poetry and covered Victorian, modernist, and contemporary literature, using texts from 9 male and 9 female authors from each period. However, their findings so far have been similar to Hoover’s. Some of the findings they highlighted were that women use more “emotive” language, personal pronouns, and “relational” words like “wife” or “brother”, while men’s language becomes increasingly sexual and colloquial moving into the contemporary period, and male modernists in particular have a tendency to use quantitative language. Male and female gender markers were most similar in the Victorian period; in the modernist period, men’s texts stayed fairly similar to each other while women’s had a huge number of outliers; and the contemporary period had the most clearly defined gender markers, with both men’s and women’s texts clustered tightly together.

Weidman presented this study with a lot of disclaimers: for example, that it was only a preliminary study; it relies on static or traditional ideas of canonicity and periodization when these categories are actually contested; that the appearance of a word like “mother” in a text doesn’t tell us the relationship the author has to that word or concept; and that (paraphrasing this point, possibly inaccurately?) that Zeta, the program it uses, is set up to look for differences between two discrete groups of texts, which means that what stands out in this study will inevitably be difference, not similarity or overlap between categories. Audience members pointed out a number of other potential issues during the Q&A: the study treats “male” and “female” as clear, fixed categories even though they might look different or need to be applied differently for (e.g.) trans authors; that the study needs to account more for historical limitations on what female authors could write about; and that this type of research doesn’t necessarily control for situations like female authors writing in the voice of male characters, and what effect this might or might not have on the vocabulary they use.

I appreciated hearing both the presenter’s and the audience’s reflections on the assumptions embedded in and potential limitations of these types of studies, even if they can be a valuable way of testing common assumptions about gender and writing. I’d also like to add that I’m often frustrated by how often conversations about this issue seem to end up in an attempt to defend women writers against the claim that they they disproportionately focus on (e.g.) emotions, domestic spaces, or family or romantic relationships, either by rejecting it entirely or by explaining what historical and social forces might have made it difficult for women to write on other topics. I very much support that type of historical inquiry and think it’s necessary as a corrective to the kind of thinking that wants to claim women are innately more relationship-focused than men (which everyone in the room seemed committed to avoiding), but I also worry that sometimes it ends up being framed in a way that implicitly accepts the idea that emotions/relationships/domestic space/etc. are less valuable topics than whatever men are supposedly writing about instead (in Hoover’s study, apparently lust and beer). Since this is a study of prose fiction specifically, why not reframe women writers’ focus on relationships, emotion, and domesticity to see it as indicative of their central role in the development of the novel, which is often defined by its ability to address family and romantic relationships and characters’ interiority in increasing detail? Why don’t we see studies like this and immediately rush to defend men against the charge that they can only write shallow texts about numbers, beer, and sex?

And what if we read this study with the assumption that its findings will refute gender stereotypes? If I understood the more technical part of the presentation correctly, Weidman and O’Sullivan found a much higher degree of variance in the language used by modernist women than the language used by modernist men; might this be used to argue that, contrary to the more common assumption, modernist women were more formally innovative than their male counterparts? The finding that women used more verbs than men on average also seems to challenge the type of discourse that I’ve seen in a lot of (questionable) advice for fiction writers that treats adjectives and adverbs as ornamental, flowery, or feminine, and verbs as action-oriented and masculine.

The question I take away from this is: one of the major advantages of DH work appears to be the possibility of engaging in more "scientific" studies than humanities work typically allows, in theory allowing us to challenge "common-sense" assumptions that may in fact be incorrect. But what happens when the findings of these studies (or the methods that produced those findings) themselves require subjective, and thus potentially biased, interpretation?