Author Archives: Tessa Maffucci

Data Mining Project, Tessa & Min, Part II

To follow up on Min’s post about retrieving data from social media platforms using Apigee, I wanted to report back about the next step in our process–preparing the data for visualization.

We decided to dig in with Instagram and see what we could do with the data returned for images tagged with the word sprezzatura. Using the method described in Min’s post we were able to export data by converting JSON to CSV. However Apigee returns paginated results for Instagram, so we were dealing with individual data dumps for 20 images when the total number of images with this tag is over 60,000. By copying and pasting the “next_url” into the Request URL field on Apigee’s console we were able to move sequentially through the sets of data:

next url

We decided to repeat this process only ten times, since the 3,000 times it would have taken to capture the whole data set seemed excessive…

When we opened the CSV files in Excel we encountered another problem. The format of the data is dictated by the number of tags, comments, likes, etc., meaning that compiling the individual data dumps into one useful Excel file was tricky.

We compiled just the header information to try to make sense of it:

Screen Shot 2014-11-18 at 1.10.01 AM

The 5th row indicates a data dump that contained a video. As a result additional columns were added to account for the video data. At first we thought that cleaning up the data from the 10 data dumps would just be a matter of adjusting for extra columns and moving the matching columns into alignment, but as we dug deeper into our data we realized that that wouldn’t work:

Screen Shot 2014-11-18 at 1.05.30 AM

As you can see, some of the data dumps go from reporting tags to a column for “type” followed by location data, while others go directly into reporting comment data. The same data is available in each data dump, but inexplicably they are not all returned in the same order.

We looked into a few options for merging Excel spreadsheets based on column headers, but either the programs weren’t Mac-friendly or the merge seemed to do more harm than good. We decided to move ahead with cleaning up the data in a more or less manual way with good old fashioned copying and pasting. We wanted to look at location data on these images (perhaps #sprezzatura is still most commonly used in Italy or maybe it’s been specifically appropriated by the fashion community in NYC?), so we decided to harvest the data for latitude and longitude. We did this by filtering the columns for latitude in each data dump to return the images that had this data (only about 1/3 of the images had geotagging turned on). We also gathered the username, the link to the image, and the time the image was posted.

We made a quick map in Tableau, just to see our data realized:

Screen Shot 2014-11-18 at 1.27.34 AM

Next steps are to make a more meaningful visualization around this topic. We’d be interested to try ImagePlot to analyze the images themselves, but we haven’t explored a batch download of Instagram photos yet.

Mapping Data: Workshop 3/3

Hi all,

Just to follow up on Mary Catherine’s post about finding data, I wanted to recap the final session of this workshop series that took place tonight.

The library guide on mapping data (by Margaret Smith) can be found here:

As in the other two workshops, Smith emphasized thinking about who would be keeping this data and why as a part of the critical research process. It’s especially interesting given the size of these data sets and maps, meaning that the person (or corporate entity, NGO, or government agency) likely has a very specific reason for hosting this information.

She brought us through a few examples from basic mapping sites, like the NYT’s “Mapping America” which pulls on 2005-2009 Census Bureau data, to basic mapping applications like Social Explorer (the free edition has limited access, but the GC has bought full access) and the USGS and NASA mapping applications. The guide also includes a few more advanced mapping options, like ArcGIS, but the tool that seemed most useful to me, in the short-term anyway, is Google’s Fusion Tables, which allows you to merge data sets that have terms in common. The example Smith used was a data set of demographic data (her example was percentage of minority students) organized by town name (her example was towns in Connecticut) and a second set of data that defined geographic boundaries by the same set of towns. Fusion Tables then lets you map the demographic data and select various ways to visualize and customize your results.

My main takeaway from this series was that each of these tools is highly particular and unique, and you have to really dig into playing with the individual system before you’ll even know if it is the right tool for your work.

That, and also learn R.