Tag Archives: Data set


Merry White Christmas~~

Dear Digitalists,

I have to say, this course is absolutely one the most fascinating courses I have ever taken (and I’m finishing my PhD—so I’ve probably taken the greatest number of courses here!). And I feel lucky to have met you all—you were such an inspiring group! Also a big round of applause to our two amazing professors—thank you for masterminding this seminar (a year ago I believe?); your pedagogical conceptions and curriculum designs are truly visionary.

Christmas is only a few days away, and I thought of posting something fun and Christmasy that is also related to my final project “Production of Desire, Consumption of Pleasure, and Creation of National Identity: Broadway Musicals as Cartography of US Sociocultural Values, 1920s-2010s.” In that spirit, why not run a data analysis of White Christmas, the Broadway musical adapted from a movie musical by Irving Berlin? It is by no means my favorite musical; in fact it is a pretty cheesy saccharine piece (with its own adorable moments). But so what? Christmas is all about eating candies and having some damn feel-good fun! So here we go:


What I’d like to see is which words stand out as topics/key words in this musical. Having been told that Mallet is best at handling topic modelling, I spent one afternoon teaching myself how to use Mallet.

I start by installing both Mallet and Java developer’s kit. Then I pull the data (all the lyrics of the 18 songs in White Christmas) into one folder under Mallet, so it’s ready to be imported. I run Mallet using the Command Line and type in commands such as “bin\mallet import-dir –help” to test it. Then I import the data and command the Mallet to create a file called “tutorial.mallet.”


Then Mallet does its job and picks out the key words:



I make another command to open this file, and by typing in this command “bin\mallet train-topics –input tutorial.mallet –num-topics 20 –output-state topic-state.gz –output-topic-keys tutorial_keys.txt –output-doc-topics tutorial_compostion.txt” I ask the Mallet to find 20 topics, and it generates 3 documents:
1. Topic-state.gz
2. Tutorial composition
3. Tutorial keys


The first one is a compressed file that outputs every word in the corpus of my input and the topic it belongs to. And here is what it looks like after extraction:


The second one is the breakdown, by percentage, of each topic within each original text file.


The third file shows what the top key words are for each topic.


I clean up the data, and the result looks like this:



Now since Mallet is known for generating a slightly different result each time, I have to try it at least twice. In my second try, I use “optimize-interval” to lead to better results.


What this does is it indicates the weight of that topic! (Under item 8, “0.04778” is the “weight” of the topic “white,” followed by key words such as “bells” “card” “snow” and “sleigh.”)


This topic-modelling process sounds really simple, but it in fact takes quite some time to familiarize with. This is a try-out example of one musical; for a larger corpus of musicals, Mallet’s power should be more evident.

As for the musical data analysis of my project, I’m thinking of combining Schenkerian analysis with automating chord progression using idiomatic analysis. It is a musicological approach rather than audio signal processing. However, I’m not shutting down the latter option, since it might turn out to be more comprehensible to the general public—our eventual target audience. Also a shout-out, musicians in the group (I know there are several), come talk with me!

Merry Christmas everyone! (Looking at these key words makes my mouth covet sweetness; now where is my Twix?! …. nom nom…)


Data Set Troubles

Hi all,

So it seems like the only way one can get old data from Twitter is to pay for it. There is a site called Topsy that seemed promising at first, because they do let you get data from several years back. However, making any reliable conclusions from that data would be hard, because they only give you ten pages of results, and you can only specify a date range, but not a time range. For the periods of time I am interested in there was so much tweeting going on that ten pages of results only covers like two hours (and there is no controlling which two hours they show you). Plus, I think they are showing “top tweets,” rather than a feed of tweets in the order they happened, so that is another factor limiting their usefulness. Not to mention Topsy doesn’t offer locational data. The people at Topsy support sent me a list of other vendors including Gnip and Datasift, which both cost. money I also looked at Keyhole, which looks great, but again, it costs money to get “historical data” from them.

Unless someone has any ideas about how to get historical data off Twitter without having to pay for it, I think I am going to shift my focus temporarily to working on tweeting surrounding the election, which should be easier, because it just happened. I will need to learn how to use the Twitter API to do that, though, so if anyone knows how, I would much appreciate any pointers you could give me. Also, I will need to figure out what my focus would be here–particularly in light of Wendy Davis’s recent loss of the gubernatorial race. Maybe I could compare this year to past years? Perhaps there was more support for Wendy Davis than for other democratic candidates?

I am still attached to my earlier idea about mapping the locational data, though, and thinking about transforming this into a final project proposal. I think there are probably some organizations in Texas that might consider funding something like this, though I am also interested in broadening the project out to have a wider appeal. Here are some ideas I had for expanding it. (I would welcome other ideas or suggestions):

  • Doing something like HyperCities, where video and other media could be layered over the map of tweets
  • Charting the tweets of conservatives as well as liberals.
  • Charting the change in tone of the tweets over multiple important moments such as the hearings and Wendy Davis’s filibuster
  • Doing a data visualization of the other hashtags that were used in relation to the ones I already know about.