Skip to main content

PhD in English from Temple University. Victorian studies, detective fiction, information culture.

#KeyDH Session 4: Supercomputers, Ancient Handwriting, Politics of Text Mining

2 min read

Rick Costa discussed the Bridges project from Carnegie Mellon, which offers supercomputing access to non-traditional users like digital humanists. It offers a bunch of portals and tools for things like analyzing film images, data management, and textual analysis, and can be used in conjunction with coding languages like and Python. To use their supercomputing power and support structure, you need to apply for a grant from the National Science Foundation. Our data for END isn't extensive enough to require something like this, but it's interesting to consider in regards to where DH might go in the future. I wonder if we're heading towards this model, where we'll see less development of individual tools at different universities and more investment in larger, comprehensive projects.

Pablo Alvarez spoke about a tool for teaching ancient Greek handwriting. It seems to be a program which helps students identify and transcribe letters on Greek manuscripts. Some of our team saw how poorly OCR works even on printed texts--imagine how necessary it is to have people transcribing ancient Greek manuscripts by hand! Alvarez focused a great deal on the educational possibilities of such a program. I wonder, how much do we learn when we do what's sometimes dismissed as "busy work" of data analysis--transcribing, making decisions about what goes in what data field, etc.

The third talk, from Justin Joque, was on the "Politics of Text Mining." Joque challenged the audience to consider DH work alongside government surveillance, specifically the data-mining of the NSA. Joque framed his argument in terms of the relation of the part to the whole--what connections do we draw between an individual text and the data we get from topic modeling a corpus? Between an individual person and the large-scale demographic information which governments and security agencies use to apply threats? This led to some provocative discussion. This related post by Michael Widner gives an overview of some of the conversation about this issue which I found helpful.

Recapping R Learning - July 17

1 min read

We had an exciting workshop today with Roberto Vargas from Swarthmore libraries. We jumped briefly ahead to chapter 10 and discussed using R to parse XML. Since our END records are all being coded in XML, this was particularly helpful and relevant (thank you, Roberto!).

We loaded the XML package, discussed XPath, and worked through the Jockers examples with Moby Dick. Roberto also showed us examples of how we could use similar code on our own XML records, drawing a helpful connection between our project and the Jockers book. He's going to send us the code he used complete with comments on what the different functions mean.

We also discussed the sink() function as a way to record outputs--it's definitely helpful to be able to pull the data results easily into a text file. I personally found it helpful to make the distinction between the "random" tags of XML and HTML. And we spent some time understanding the concept of "nodes" (our individual Marc records) and the ways in which a document can be split into pieces and analyzed based on these pieces.

Recapping R Learning - June 26

1 min read

We kept working through Matt Jocker's book at our workshop. Most of us are working through chapter 4, which seems to be a chapter that has a lot to absorb. A few of us have reached the end of chapter 5, meaning that we've mostly completed the "Microanalysis" portion of the book.

Nabil Kashyap from Swarthmore Library visited and gave us tips, including making full use of the auto-completion options in R Studio. We talked through some syntax questions we've been working out. We also discussed what we'd like to ultimately be able to do in R Studio--how will it help us with our own projects? What additional functionality does it give us?

Coming up soon--Mesoanalysis!

Quick map from our pre-2013 data

1 min read

I made a quick map of publication data that we gathered pre-2013. Points are sized based on how many texts we've cataloged from that location. Play around with it!

It still needs some cleaning--Petersburg should be in Virginia, not Russia, for example. But it does give a rough sense of where our collection of 18th-century novels were published.

I used Excel to isolate the information I wanted from our data, OpenRefine and Notepad++ to clean it, to get the locations, and CartoDB to map it.

Recapping R Learning - June 19

1 min read

We missed Katie Rawson today at our workshop! But we managed to finish chapter 3 and get pretty far through chapter 4 of our textbooks.

The theme of today was display issues. Some of our codes began changing color in unpredictable ways, and our charts took on various sizes.

We came to the consensus is that it's often difficult to follow exactly what the different functions mean as we go through the text. We also want to know more about spaces--sometimes they seem irrelevant, and sometimes they gum up the whole system.

At the beginning, as per usual, we spent some time getting things up and running again. I actually think the fact that we come back to this once a week is probably giving us a slight cognitive advantage, since we're being forced to recall what we did last time. My sense was that it's going just a bit more smoothly for people, and we're getting better at hunting down typos and helping each other.

What do we want to learn about for Theory Thursday?

1 min read

-More on the rise of the novel
-Background on the 18th-century novel, back to 17th century
-Social context of 18th century

-Samuel Johnson’s essay in Rambler about new species of fiction.
-Book club?


Skill Sets to Develop for Individual Projects

1 min read

We talked this morning about the skills the students want to develop for their projects. Here are a few things that came up multiple times.
-how to pull out the specific datafields you're interested in
-automatically compare, graph connections with, etc. data
-automatically finding place names in full-text files
-mapping tools (maybe Mitch can give us his CartoDB workshop?)


Visualizing Data

1 min read

Here's what we talked about today. There are a ton more!
Voyant: Word clouds, word tracking.
Lexos: Dendrograms, stylistic similarity.
Mallet: Topic modeling.
CartoDB: Mapping.
Gephi: Network analysis.
NodeXL: Twitter/social media networks.


ESTC tags that might signify novels

1 min read

We're trying to find all of the novels in the ESTC. Here are the possible tags, and what we decided. Feel free to offer up your opinions!

Fiction, Novel, & Imaginary - definite yes Allegories - yes Comic histories - yes Fables & Parables - maybe, but they might be too short Fairy tales - yes Fantasy literature - yes Folktales, Folklore, and Myths and Legends - maybe Novellas - yes Nursery stories - no Romances - yes Sea stories - yes Utopian literature - yes