css.php

Data Set: Topic Modeling DfR

Hello Praxisers, I’m writing today about a dataset I’ve found. I’ll be really interested to hear any thoughts on how best to proceed, or more general comments.

I queried JSTOR’s dfr.jstor.org Data for Research for citations, keywords, bigrams, trigrams and quadgrams for the full run of PMLA. JSTOR gives this data upon request for all archived content. To do this I had to request an extension of the standard 1000 docs you can request from DfR. I then submitted the query and received an email notification several hours later that the dataset was ready for download at the DfR site. Both the query and the download are managed through the “Dataset Requests” tab at the top right of the website. It was a little over a gig, and I unzipped it and began looking at the files one by one in R.

Here’s where I ran into my first problem. I basically have thousands of small documents, with citation info for one issue per file, or a list of 40 trigrams from a single issue. My next step is to figure out how to prepare these files so that I’m working with a single large dataset instead of thousands of small ones.

I googled “DfR R analysis” and found a scholar, Andrew Goldstone, who has been working on analyzing the history of literary studies with DfR sets. His GitHub  contains a lot of the code and methodology for this analysis, including a description of his use of Mallet topic modeling through an R package. Not only is the methodology available, but so is the resulting artifact, a forthcoming article in New Literary History. My strategy now is simply to try to replicate some of his processes with my own dataset.

 

3 thoughts on “Data Set: Topic Modeling DfR

  1. Micki

    Really great idea – most of all, really enjoyed how engaging this post is for the reader. You have MALLET, R, Github and the JSTOR domain issues at play, and of course then what to do with the data! I will suggest that while the goal of replicating and understanding the prior research using this data and method, you may find your results impelling you in different directions than Goldstone’s, so be ready to explore uncharted territory!

  2. Elissa Myers

    Sounds really interesting! I am thinking about doing something (very undefined at this point) with the 19th Century UK periodical databases for my final project, so I look forward to reading more about your project method, which could end up giving me some useful methodological ideas!

  3. Liam Sweeney

    Thanks Micki and Elissa. I’m afraid I’ll have to find some time to dive back into an R MOOC before I can set anything up. But at the moment I’m playing with these visualizations: http://agoldst.github.io/dfr-browser/demo/#/model/scaled

    This is actually more or less the PMLA data I pulled. It’s sort of amazing to see the possibilities once you’re familiarized with the tools. Elissa, exciting to learn that you’re interested in the 19th Century British Pamphlets. I seem to remember hearing that there are some distinctions the service makes between archives and primary sources… I’ll double check that. For now, I can report that I’ve installed the devtools package: install.package(“devtools”) and then installed a package Goldstone made for DfR topic modeling:
    library(devtools)
    install_github(“dfrtopics”,”agoldst”)
    I’m basically following instructions here: https://github.com/agoldst/dfrtopics
    If this is helpful I can just keep posting progress on this thread, and if you have any breakthroughs please share!

Comments are closed.