Tag Archives: R

Journaling Data, Chapter 2

Statistics and R

I am pursuing two unrelated paths. The first of which is a collaborative path with Joy. She has identified some interesting birth statistics. The file we started with was a PDF downloaded from the CDC (I believe). I used a website called zamzar.com to convert the PDF to a text file. The text file was a pretty big mess, because it included a lot of text in addition to the tabular data that we are interested in.

Following techniques that Micki demonstrated in her Data Visualization Workshop, I used Text Wrangler to cut out a single table and gradually clean it up. I eliminated commas in numeric fields, and extra spaces. I inserted line feeds etc. until I had a pretty good tab-delimited text file, which imported very cleanly into Excel, where I did some additional cleaning and saved the table as a CSV file that would work well in R. The table reads into R very cleanly so that we can perform simple statistics on it such median, min and max.

 Text Analysis

My other data path is working with text, specifically, Dickens’ “Great Expectations”. I have used no fewer than three different tools to open some windows onto the book First a loaded a text file version of the book into Antconc, “…a freeware tool for carrying out corpus linguistics research and data-driven learning.” I was able to generate word counts and examine word clusters by frequency. The tool is very basic so until I had a more specific target to search, I set Antconc aside.

At Chris’s suggestion I turned to a website called Voyant-tools.org, which quickly creates a word cloud of your text/corpus. What it does nicely is provide the ability to apply a list of stop words, which eliminates many common frequently used words, such as ‘the’ and ‘to’. Using Voyant, I was able to very quickly create a word cloud and zero in on some interesting items.

Screenshot 2014-11-09 10.19.59

The most frequently mentioned character is Joe (in fact, Joe is the most frequent word) and not Pip or Miss Havisham. That discovery sent me back to Antconc to understand the contexts in which Joe appears . Other words that loom large in the word cloud and will require further investigation are ‘come’ & ‘went’ as pair, ‘hand’ and ‘hands’, ‘little’ and ‘looked’/looking’.

Lastly, I have run the text through the Mallet topic modeler and while don’t know what to make of it yet, the top ten topics proposed by Mallet make fascinating reading, don’t they?

  1. miss havisham found left day set bed making low love
  2. made wemmick head great night life light part day dark
  3. mr pip jaggers pocket mrs young heard wopsle coming question
  4. boy knew herbert dear moment side air began hair father
  5. time long face home felt give manner half replied person
  6. back thought house make ll pumblechook herbert thing told days
  7. joe don mind place table door returned chair hope black
  8. hand put estella eyes asked stood gentleman sir heart london
  9. good round hands room fire gave times turned money case
  10. man looked biddy sister brought held provis sat aged child

At this point the exploration needs to be fueled by some more pointed questions that need answering. That is what will drive the research. Up until now it has been the tools that have been leading the way as I discover what they can do and what buttons to push to make them do it.











Journaling Data (what else?)

“Day 1”

Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.

I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.

On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.


“Day 2”

So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.

I am going to download Great Expectations  from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.

“Day 3”

Stay tuned…

Data Set: Topic Modeling DfR

Hello Praxisers, I’m writing today about a dataset I’ve found. I’ll be really interested to hear any thoughts on how best to proceed, or more general comments.

I queried JSTOR’s dfr.jstor.org Data for Research for citations, keywords, bigrams, trigrams and quadgrams for the full run of PMLA. JSTOR gives this data upon request for all archived content. To do this I had to request an extension of the standard 1000 docs you can request from DfR. I then submitted the query and received an email notification several hours later that the dataset was ready for download at the DfR site. Both the query and the download are managed through the “Dataset Requests” tab at the top right of the website. It was a little over a gig, and I unzipped it and began looking at the files one by one in R.

Here’s where I ran into my first problem. I basically have thousands of small documents, with citation info for one issue per file, or a list of 40 trigrams from a single issue. My next step is to figure out how to prepare these files so that I’m working with a single large dataset instead of thousands of small ones.

I googled “DfR R analysis” and found a scholar, Andrew Goldstone, who has been working on analyzing the history of literary studies with DfR sets. His GitHub  contains a lot of the code and methodology for this analysis, including a description of his use of Mallet topic modeling through an R package. Not only is the methodology available, but so is the resulting artifact, a forthcoming article in New Literary History. My strategy now is simply to try to replicate some of his processes with my own dataset.