Skip to main content

topic modeling: introductory resources

3 min read

A great tutorial for getting MALLET installed and running is Shawn Graham , Scott Weingart and Ian Milligan's Getting Started with Topic Modeling and MALLET. I recommend working through the "using the command line" tutorial they link to at the beginning of the post if you aren't familiar with the command line yet. Take your time, go step by step, ask for help if you need it - I think you will find it easy and fun.

To be able to use MALLET effectively, you'll need to know a bit about both the theory and the practice of topic modeling. Below are some introductory resources I find people find useful; there are many others out there that may better suit your particular learning style. The key is to read a few of them, then begin messing around with MALLET, and then continue to read around to deepen your understanding as you begin to get some hands-on experience.

Topic Modeling and Digital Humanities by David Blei in the special topic modeling issue of the Journal of Digital Humanities; it offers an accessible introduction. (Blei's article Probabilistic Topic Modeling has more detail but is also really useful. You need not understand it all to get something out of it.)

In the same issue of JDH I also recommend reading Megan Brett’s Topic Modeling: A Basic Introduction and then look through/skim the “applications” essays by Lisa Rhody and others.

After doing this to orient yourself, you may wish to read Ted Underwood’s more technical blog post Topic modeling made just simple enough for a more detailed perspective. (Note that this post, like the JDH articles, is from 2012 and some information may be out of date.)

David Mimno's video "The Details: Training and Validating Big Models on Big Data" is very useful to view (maybe a few times) at some point; even if you don't understand it all, you will get something out of it.

You may also want to take a look at the chapter on topic modeling from Matt Jockers's book on text-mining Macroanalysis - see our "readings" Dropbox for a pdf. I will also add some more examples of good recent articles using topic modeling methods for you to browse through.

There are also a fair number of critique-of-topic-modeling articles and blog posts out there. Most of them warning you against too quickly interpreting the MALLET output as semantically meaningful. I'm not worried that you will do this. I will say that I've found it useful to think of this kind of topic modeling almost as a kind of fiction; I think to think of topics as a counterfactual, almost fictional set of materials out of which your documents might have been created - but clearly weren't. This helps me stay away from thinking about "topics" as any kind of simplistic map of a the "contents" of a set of documents.

The other thing I like to remember is that topic modeling is probabilistic in a number of senses - keep this in mind as you learn about it.