Data Set Troubles

Hi all,

So it seems like the only way one can get old data from Twitter is to pay for it. There is a site called Topsy that seemed promising at first, because they do let you get data from several years back. However, making any reliable conclusions from that data would be hard, because they only give you ten pages of results, and you can only specify a date range, but not a time range. For the periods of time I am interested in there was so much tweeting going on that ten pages of results only covers like two hours (and there is no controlling which two hours they show you). Plus, I think they are showing “top tweets,” rather than a feed of tweets in the order they happened, so that is another factor limiting their usefulness. Not to mention Topsy doesn’t offer locational data. The people at Topsy support sent me a list of other vendors including Gnip and Datasift, which both cost. money I also looked at Keyhole, which looks great, but again, it costs money to get “historical data” from them.

Unless someone has any ideas about how to get historical data off Twitter without having to pay for it, I think I am going to shift my focus temporarily to working on tweeting surrounding the election, which should be easier, because it just happened. I will need to learn how to use the Twitter API to do that, though, so if anyone knows how, I would much appreciate any pointers you could give me. Also, I will need to figure out what my focus would be here–particularly in light of Wendy Davis’s recent loss of the gubernatorial race. Maybe I could compare this year to past years? Perhaps there was more support for Wendy Davis than for other democratic candidates?

I am still attached to my earlier idea about mapping the locational data, though, and thinking about transforming this into a final project proposal. I think there are probably some organizations in Texas that might consider funding something like this, though I am also interested in broadening the project out to have a wider appeal. Here are some ideas I had for expanding it. (I would welcome other ideas or suggestions):

  • Doing something like HyperCities, where video and other media could be layered over the map of tweets
  • Charting the tweets of conservatives as well as liberals.
  • Charting the change in tone of the tweets over multiple important moments such as the hearings and Wendy Davis’s filibuster
  • Doing a data visualization of the other hashtags that were used in relation to the ones I already know about.

9 thoughts on “Data Set Troubles

  1. Kelly Marie Blanchat


    Did you consider using the TAGS explorer for hashtags related to Wendy Davis? I’m not to the level of being able to use the Twitter API to gather data, but I was able to create a google archive and transfer it into a visualization using the instructions Matt posted. It might not get you the location data you want, but it’s a start, perhaps.
    Otherwise, I feel like you and I are in the same boat with the data visualization project. There’s so much I want to do, but I can’t quite get there. I’m also unable to get to the data visualization workshops (and last night my dataset imported into Gephi broke my computer — yikes).
    There’s a portion of the final assignment that asks “Do you need to know code in order to do DH?” I’m feeling the answer is more and more, “yes”. I know some, but I’m getting stuck.


    PS: Sorry about the Texas election. Major bummer. Twitter was encouraging, however.

  2. James Mason

    TAGS Explorer is okay. Depending on the size of your dataset, you could quickly find yourself running out of space. By default, TAGS moves the tweets to a Google Spreadsheet, which caps at 400,000 cells. With the multiple columns that TAGS takes up to begin with, I was able to get about 4000 tweets out of that. You can set it up so that it switches to another sheet after the first is filled, but unless you can work with the Twitter API, you’ll have to do this manually. Even if you set it to send your data to Excel, you’d still be met with their 1mil cell limit.

    As for getting historical data, well… the one thing I learned from my dataset project is that it gets VERY expensive to get this sort of (big) data. When I was doing my project I used two programs, Keyhole and Hashtracking. To retrieve historical data from Hashtracking, you can pay 2$ per every 1000 tweets you want. As far as Keyhole goes… well… they didn’t even get back to me when I requested a quote for the 800k tweet archive I was trying to get.

    Even if you were to learn the Twitter API, I am pretty sure you can AT BEST retrieve tweets up to one week old (I may be wrong on this). It’s absolutely crazy when you consider that all of these services, like Hashtracking, Keyhole, and surely Topsy are archiving tweets and have just tons and tons of data stored and ready for sale to the businesses that need. Until academia is able to broker some sort of deal to get this stuff for the purposes of research, the data is far too valuable for businesses and in this case, politicians to be free.

  3. Sarah Cohn

    Related to what we talked about earlier. making a basic Twitter archive using TAGS

    Martin Hawksey’s TAGS google spreadsheet system to create your own archive

    Once you’ve set up the archive, it will automatically pull in new tweets (and a bunch of metadata) that uses the tag in question. You can then visualize the archive using Hawskey’s TAGS Explorer –

  4. Elissa Myers Post author

    Thanks so much to Kelly for the link to this data set and to Sarah for the Google Tags Explorer suggestion! The data set is exactly what I was looking for, and I think Google Tags will be useful for analyzing it! Now I just need to get the data out of the JSON format it is in and into Excel or txt. When I tried to put it in Excel, some of the info was getting cut off. Any suggestions?

  5. Elissa Myers Post author

    Just to clarify, the problem is that the JSON files I got do not always order fields in any given order. So, if I transfer this stuff to Excel, location might be in column 10 for one tweet, and column 15 for another tweet. Unfortunately, this is so bad sometimes that you can’t even tell what piece of info is supposed to be in a given column, so reformatting it by hand would be tedious and nigh impossible.

    I found something called SEO tools for Excel, which seems like it would theoretically be useful, because it is designed to work with Twitter data in Json format, but it seems like most of the functions I have seen pull data off Twitter, rather than reformatting what is already there. I now have a great data set, but I would really like it to be consistently formatted in Excel, so when I start visualizing and analyzing, my results are not skewed. Any suggestions would be most, most welcome!

  6. Mary Catherine Kinniburgh


    I’m not much help for the Twitter-specific stuff, but it looks like you’re in good hands. I really look forward to seeing how your project progresses, and your posts have raised some interesting questions on the availability of data and the “business” of data preservation.


Comments are closed.