Tag Archives: data visualization

The Freedom to Move

José Palazon/Reuters African asylum seekers stuck on a razor wire fence, behind white-clad golfers teeing off on a golf course.

José Palazon/Reuters
African asylum seekers stuck on a razor wire fence, behind white-clad golfers teeing off on a golf course.

I started this data set project a couple of weeks ago, but it has taken me a while to post and share. I think I was looking to show something “more”- more complete, more thought out, more visually appealing, more theoretically/methodologically sound, etc. And then I realized that “more” probably won’t come if the project is sitting idly in Tableau, unseen by others. While I love the idea of collaborating, I’m still getting used to the idea of sharing incomplete things in the process. My lack of technological savvy also had me guarded for a while. Luckily, seeing everyone’s impressive progress and feedback to others have given me more courage to just share what I’ve done already.

In my income and inequality class, a great class filled with data, Professor Milanovic led me to a study on travel freedom, The Henley and Partners Visa Restrictions Index. I haven’t had the chance to do additional research on this for-profit organization, which I would conduct if I were to pursue this further. The survey received a substantial amount of media attention and Wikipedia based its “Visa requirements for United States citizens” page on this survey. I mention the media and wiki attention to this survey to not verify its reliability but to show the survey’s influence.  In Prof. Milanovic’s class, we were looking at migration as a means of addressing global inequality (the income inequality between countries) but there are travel restrictions in many areas of the world that would benefit the most economically from migration. To make this connection clearer, I matched the country’s travel freedom ranking with its 2013 GDP, provided by the World Bank. Several other organizations, such as the UN, IMF, CIA, track GDP, but I decided to go with the World Bank’s numbers.

The visa restrictions indez was in a pdf with multiple columns, which required some text cleaning after pasting into a text editor, Notepad++. I had to learn about regular expressions in order to do this efficiently. Pretty simple, I had to replace SPACE with TAB, but I also had to keep in mind of multi-word nations like Central African Republic or United States. It wasn’t a clean find and replace at the end, so I had to clean up some things manually. But overall, a great tip I picked up at Micki’s data visualization class.

I then saved the file as a csv. I created another column to insert GDP data from the World Bank’s data repository, which is a great source of information. I downloaded a spreadsheet that included the GDP for each country for every year since 1980. I was interested in the latest, 2013 data. Inserting this information required some more manual work. I’m sure I could have used an Excel function, but after spending some time looking for that function, my impatience got the best of me and I decided to do it the not-so-quick and dirty way. I copied and pasted after I put the countries in alphabetic order. For the most part the naming conventions were the same, so it didn’t take very long. If I were to do this again, I would definitely figure out how to do this correctly, but I was didn’t want to lose my momentum.

So my data looks like this:

After combining the cleaning the Henley and Partners Visa Restriction Index data and World Bank 2013 GDP data

After combining the cleaning the Henley and Partners Visa Restriction Index data and World Bank 2013 GDP data

I decided to use Tableau to visualize this data. I wanted to highlight the geographical aspect of this data, as we are talking about visa and travel freedom. I thought it would be interesting to see where the clusters of countries with the highest travel freedom are in comparison to countries with the lowest travel freedom. I didn’t know how to show GDP simultaneously besides showing up in the bubble when you hover over the countries. Here is a snapshot of the map. You can go here for the “interactive” version, where you will see the GDP information.

You will notice that the countries with high GDP has the least amount of travel restrictions. The countries with the lowest GDP, the countries whose citizens would benefit the most from migration by taking available jobs and escaping political corruption in their home countries, have the least amount of travel freedom. So one would come to the conclusion that the current system of immigration is counterproductive in addressing global inequality.

The field of economics, as one would expect, is extremely data-heavy. Our professor would leave his Stata codes at the bottom of his slides in case we wanted to recreate them. As a non-economics major, the hard numbers/algorithms stuff made me a little nervous, but I was also excited in seeing these types of sociological patterns. Visualizing these patterns are important because of its potential to raise awareness and political activism. Letting them stay hidden in industry publications or esoteric economic conferences won’t do much good, but publicizing and presenting them in ways that will grab people’s attention, might. The media has been really flexing its data visualization muscles recently. That being said, I was really happy to hear about the GC’s potential course crossovers with the Journalism School. It might give me a chance to keep pursuing this data project.


Mapping the Icelandic Outlaw Sagas

Greetings, fellow digitalists,

*warning: long read*

I am so impressed with everyone’s projects! I feel like I blinked on this blog, and all of a sudden everything started happening! I’ve tried to go back and comment on all your work so far–let me know if I’ve missed anything. Once again, truly grateful for your inspiring work.

Now that it’s my turn: I’d like to share a project that I’ve been working on for the past year or so. I’ll break it down into two blog posts–one where I discuss the first part, and the other that requests your assistance for the part I’m still working on.

A year ago, I received funding from Columbia Libraries Digital Centers Internship Program to work in their Digital Social Sciences Center on a digital project of my own choosing. I’ve always gravitated towards the medieval North Atlantic, particularly with anything dark, brooding, and scattered with thorns and eths (these fabulous letters didn’t make it from Old English and Old Norse into our modern alphabet: Þ, þ are thorn, and Ð, ð are eth). Driven by my research interests in the spatiality of imaginative reading environments and their potential lived analogues, I set out to create a map of the Icelandic outlaw sagas that could account for their geospatial and narrative dimensions.

Since you all have been so wonderfully transparent in your documentation of your process, to do so discursively for a moment: the neat little sentence that ends the above paragraph has been almost a year in the making! The process of creating this digital project was messy, and it was a constant quest to revise, clarify, research, and streamline. You can read more about this process here and here and here, to see the gear shifts, epic flubs, and general messiness this project entails.

But, to keep with this theme of documentation as a means of controlling data’s chaotic properties, I’ve decided to thematically break down this blog post into elements of the project’s documentation. Since we’ve already had some excellent posts on Gephi and data visualization, I’ll only briefly cover that part of my project towards the end–look for more details on that part two in another blog post, like I mention above.

As a final, brief preface: some of these sections have been borrowed from my actual codebook that I submitted in completion of this project this past summer, and some parts are from an article draft I’m writing on this topic–but the bulk of what you’ll read below are my class-specific thoughts on data work and my process. You’ll see the section in header font, and the explanation below. Ready?

Introduction to the Dataset

The intention of this project was to collect data on place name in literature in order to visualize and analyze it from a geographic as well as literary perspective. I digitized and encoded three of the Icelandic Sagas, or Íslendingasögur, related to outlaws from the thirteenth and fourteenth centuries, titled Grettis saga (Grettir’s Saga), Gísla saga Súrssonar (Gisli’s Saga), and Hardar Saga og Hölmverja (The Saga of Hord and the People of Holm). I then collected geospatial data through internet sources (rather than fieldwork, although this would be a valuable future component) at the Data Service of the Digital Social Sciences Center of Columbia Libraries, during the timeframe of September 17th, 2013, to June 14th, 2014. Additionally, as part of my documentation of this data set, I had to record all of the hardware, software, and Javascript libraries I used–this, along with the mention of the date, will allow my research to be reproduced and verified.

Data Sources

Part of the reason I wanted to work with medieval texts is their open-source status; stories from the Íslendingasögur are not under copyright in their original Old Norse or in most 18th and 19th century translations and many are available online. However, since this project’s time span was only a year, I didn’t want to spend time laboriously translating Old Norse when the place names I was extracting from the sagas would be almost identical in translation. With this in mind, I used the most recent and definitive English translations of the sagas to encode place name mentions, and cross-referenced place names with the original Old Norse when searching for their geospatial data (The Complete Sagas of Icelanders, including 49 Tales. ed. Viðar Hreinsson. Reykjavík: Leifur Eiríksson Pub., 1997).


When I encountered this section of my documentation (not as a data scientist, but as a student of literature), it took me a while to consider what it meant. I’ll be using the concept of “data’s universe,” or the scope of the data sample, as the fulcrum for many of the theoretical questions that have accompanied this project, so prepare yourself to dive into some discipline-specific prose!

On the one hand, the universe of the data is the literary world of the Icelandic Sagas, a body of literature from the 13th and 14th centuries in medieval Iceland. Over the centuries, they have been transmuted from manuscript form, to transcription, to translation in print, and finally to digital documents—the latter of which has been used in this project as sole textual reference. Given the manifold nature of their material textual presence—and indeed, the manuscript variations and variety of textual editions of each saga—we cannot pinpoint the literary universe to a particular stage of materiality, since to privilege one form would exclude another. Seemingly, my data universe would be the imaginative and conceptual world of the sagas as seen in their status as literary works.

A manuscript image from Njáls saga in the Möðruvallabók (AM 132 folio 13r) circa 1350, via Wikipedia

However, this does not account for the geospatial element of this project, or the potential real connections between lived experience in the geographic spaces that the sagas depict. Shouldn’t the data universe accommodate Iceland’s geography, too? The act of treating literary spaces geographically, however, is a little fraught: in this project, I had to negotiate specifically the idea that mapping the sagas is at all possible, from a literary perspective. In the latter half of the twentieth century, scholars considered the sagas as primarily literary texts, full of suspicious monsters and other non-veracities, and from this perspective could not possibly be historical. Thus, the idea of mapping the sagas would have been irrelevant according to this logic, since seemingly factual inconsistencies undermined the historical, and thus geographic, possibilities of the sagas at every interpretive turn.

However, interdisciplinary efforts have been increasingly challenging this dismissive assumption. Everything from studies on pollen that confirm the environmental difficulties described in the sagas, to computational studies that suggest the social patterns represented in the Icelandic sagas are in fact remarkably similar quantitatively to genuine relationships suggest that the sagas are concerned with the environment and geography that surrounded their literary production.

But even if we can create a map of these sagas, how can we counter the critiques of mapping that it promotes an artificial “flattening” of place, removing the complexity of ideas by stripping them down to geospatial points? Our course text, Hypercities, speaks to this challenge by proposing the creation of “deep maps” that account for temporal, cultural, and emotional dimensions that inform the production of space. I wanted to preserve the idea of the “deep map” in my geospatial project on the Icelandic Sagas, so in a classic etymological DH move (shout out to our founding father Busa, as ever), I attempted to find out more about where the idea of “deep mapping” might have predated Hypercities, which only came out this year yet represents a far earlier concept.

I traced the term “deep mapping” back to William Least Heat-Moon, who coined the phrase in the title of his book, PrairyErth (A Deep Map) to indicate the “detailed describing of place that can only occur in narrative” (Mendelson, Donna. 1999. “‘Transparent Overlay Maps’: Layers of Place Knowledge in Human Geography and Ecocriticism.” Interdisciplinary Literary Studies 1:1. p. 81). According to this definition, “deep maps” occur primarily in narrative, creating depictions on places that may be mapped on a geographic grid that can never truly account for the density of experience that occurs in these sites. Heat-Moon’s use of the phrase, however, does not preclude earlier representations of the concept; the use of narratives that explore particular geographies is as old as the technology of writing. In fact, according to Heat-Moon’s conception of deep mapping, we might consider the medieval Icelandic sagas a deep map in their detailed portrayal of places, landscape, and the environment in post-settlement Iceland. Often occurring around the locus of a regional few farmsteads, the Sagas describe minute aspects of daily Icelandic life, including staking claim to beached whales as driftage rights, traveling to Althing (now Thingvellir) for annual law symposiums, and traversing valleys on horseback to seek out supernatural foes for battle. Adhering to a narrative form not seen again until the rise of the novel in the 18th century, the Íslendingasögur are a unique medieval exempla for Heat-Moon’s concept of deep mapping and the resulting geographic complexity that results from narrative. Thus, a ‘deep map’ may not only include a narrative, such as in the sagas’ plots, but potentially also a geographic map for the superimposition of knowledge upon it–allowing these layers of meaning to build and generate new meaning.

To tighten the data universe a little more: specifically within the sagas, I have chosen the outlaw-themed sagas for their shared thematic investment in place names and geography. Given that much of the center of Iceland today consists of glaciers and wasteland, outlaws had precious few options for survival once pushed to the margins of their society. Thus, geographic aspects of place name seem to be just as essential to the narrative of sagas as their more literary qualities—such as how they are used in sentences, or what place names are used to obscure or reveal.

Map of Iceland, by Isaac de La Peyrère, Amsterdam, 1715. via Cornell University Library, Icelandica Collection

In many ways, the question of “universe” for my data is the crux of my research question itself: how do we account for the different intersections of universes—both imaginative and literary, as well as geographic and historical—within our unit of analysis?

Unit of Analysis

If we dissect the element that allows geospatial and literary forms to interact, we arrive at place name. Place names are a natural place for this site of tension between literary and geographic place, since they exist in the one shared medium of these two modes of representation: language. In their linguistic as well as geographic connotations, place names function as the main site of connection between geographic and narrative depictions of space, and it is upon this argument that this project uses place name as its unit of analysis.


Alright, now that we’re out of the figurative woods, on to the data itself. Here are the steps I used to create a geospatial map with metadata for these saga place names.

Data Collection, Place Names:

The print text was scanned and digitized using ABBYY FineReader 11.0, which performs Optical Character Recognition to ensure PDFs are readable (or “optional character recognition, as I like to say) and converted to an XML file. I then used the flexible coding format of the XML to hand-encode place name mentions using TEI protocol and a custom schema. In the XML file, names were cleaned from OCR and standardized according to Anglicized spellings to ensure searchability across the data, and for look-up in search engines such as Google–this saved a step in data clean-up once I’d transformed the XML into a CSV.



Here’s the TEI header of the XML document–note that it’s nested tags, just like HTML.

Data Extraction / Cleanup

In order to extract data, the XML document was saved as a CSV. Literally, “File > “Save As.” This is a huge benefit of using flexible mark-up like XML, as opposed to annotation software that can be difficult to extract data from, such as NVivo, which I wrote about here on Columbia University Library’s blog in a description of my methodology. In the original raw, uncleaned file, columns represented encoding variables, and rows represented the encoded text. I cleaned the data to eliminate column redundancies and extraneous blank spaces, as well as to preserve the following variables: place name, chapter and saga of place name, type of place name usage, and place name presence in poetry, prose, or speech. I also re-checked my spelling here, too–next time, no hand-encoding!



Here’s the CSV file after I cleaned it up (it was a mess at first!)


I saved individual CSVs, but also kept related info in an Excel document. One sheet, featured here, was a key to all the variables of my columns, so anyone could decipher my data.

Resulting Metadata:

Once extracted, I geocoded place names using the open-source soft- ware Quantitative Geographic Information Systems (QGIS), which is remarkable similar to ArcGIS except FREE, and was able to accommodate some of those special Icelandic characters I discussed earlier. The resulting geospatial file is called a shapefile, and QGIS allows you to link the shapefile (containing your geospatial data) with a CSV (which contains your metadata). This feature allowed me to link my geocoded points to their corresponding metadata (the CSV file that I’d created earlier, which had place name, its respective saga, all that good stuff) with a unique ID number.

Data Visualization, or THE BIG REVEAL   

While QGIS is a powerful and very accessible software, it’s not the most user-friendly. It takes a little time to learn, and I certainly did not expect everyone who might want to see my data would also want to learn new software! To that end, I used the JavaScript library Leaflet to create an interactive map online. You can check it out here–notice there’s a sidebar that lets you filter information on what type of geographic feature the place name comprises, and pop-ups appear when you click on a place name so you can see how many times it occurs within the three outlaw sagas. Here’s one for country mentions, too.



Click on this image to get to the link and interact with the map.


As the process of this documentation highlights, I feel that working with data is most labor-intensive when it comes to positioning the argument you want your data to make. Of course, actually creating the data by encoding texts and geocoding takes a while too, but so much of the labor in data sets in the humanities is intellectual and theoretical. I think this is a huge strength in bringing humanities scholars towards digital methodologies: the techniques that we use to contextualize very complex systems like literature, fashion, history, motherhood, Taylor Swift (trying to get some class shout-outs in here!) all have a LOT to add to the treatment of data in this digital age.

Thank you for taking the time to read this–and please be sure to let me know if you have any questions, or if I can help you in mapping in any way!

In the meantime, stay tuned for another brief blog post, where I’ll solicit your help for the next stage of this project: visualizing the imaginative components of place name as a corollary to this geographic map.

Twitter and #Ferguson

Dear all,

In the aftermath of the Ferguson decision, and the much-discussed condemning of social media in McCulloch’s speech, we can see the high stakes of a lot of questions we’ve discussed in this class so far.

Perhaps as a way of opening the conversation, here’s a link that shows tweets on #Ferguson and the temporal “hot spots” that happened around key events. Particularly when live-reporting is occurring online, and I’ve seen a lot of news outlets get facts wrong, Twitter’s communicative power is really being harnessed.



Data Mining Project, Tessa & Min, Part II

To follow up on Min’s post about retrieving data from social media platforms using Apigee, I wanted to report back about the next step in our process–preparing the data for visualization.

We decided to dig in with Instagram and see what we could do with the data returned for images tagged with the word sprezzatura. Using the method described in Min’s post we were able to export data by converting JSON to CSV. However Apigee returns paginated results for Instagram, so we were dealing with individual data dumps for 20 images when the total number of images with this tag is over 60,000. By copying and pasting the “next_url” into the Request URL field on Apigee’s console we were able to move sequentially through the sets of data:

next url

We decided to repeat this process only ten times, since the 3,000 times it would have taken to capture the whole data set seemed excessive…

When we opened the CSV files in Excel we encountered another problem. The format of the data is dictated by the number of tags, comments, likes, etc., meaning that compiling the individual data dumps into one useful Excel file was tricky.

We compiled just the header information to try to make sense of it:

Screen Shot 2014-11-18 at 1.10.01 AM

The 5th row indicates a data dump that contained a video. As a result additional columns were added to account for the video data. At first we thought that cleaning up the data from the 10 data dumps would just be a matter of adjusting for extra columns and moving the matching columns into alignment, but as we dug deeper into our data we realized that that wouldn’t work:

Screen Shot 2014-11-18 at 1.05.30 AM

As you can see, some of the data dumps go from reporting tags to a column for “type” followed by location data, while others go directly into reporting comment data. The same data is available in each data dump, but inexplicably they are not all returned in the same order.

We looked into a few options for merging Excel spreadsheets based on column headers, but either the programs weren’t Mac-friendly or the merge seemed to do more harm than good. We decided to move ahead with cleaning up the data in a more or less manual way with good old fashioned copying and pasting. We wanted to look at location data on these images (perhaps #sprezzatura is still most commonly used in Italy or maybe it’s been specifically appropriated by the fashion community in NYC?), so we decided to harvest the data for latitude and longitude. We did this by filtering the columns for latitude in each data dump to return the images that had this data (only about 1/3 of the images had geotagging turned on). We also gathered the username, the link to the image, and the time the image was posted.

We made a quick map in Tableau, just to see our data realized:

Screen Shot 2014-11-18 at 1.27.34 AM

Next steps are to make a more meaningful visualization around this topic. We’d be interested to try ImagePlot to analyze the images themselves, but we haven’t explored a batch download of Instagram photos yet.

Small Data Viz

I am interested in visualizing information, well… I am also interested in visual information, and visual literacy.

I swear I am creative. and visual. 🙂

I work at bronx Community College and I have been trying to engage with the campus history. Bronx Community College used to be NYU at the turn of the century, NYU constructed a Hall of Fame for great Americans. It is a stone colonnade and every few years they had this nationally published event to induct people into the hall of great americans. People from all different fields of study all great for being Americans. There was a democratic-ish nomination process that is VERY well documented and then the Hall of Fame Delegates had the final vote about who got inducted. The last person was inducted in the 1970’s right after the campus was sold to CUNY from NYU to become Bronx Community College. Here is a link to the google doc for all the data I collected about each hall of fame inductees. I created a tiny data set made of all the information we have in different pamphlets about the Hall of Fame busts.

My students have to write papers about the hall of fame and find lots of Biographies about the Hall of Fame people, and sometimes they come to the desk and they are like “which one is the coolest? I dont want a lame person.” I thought I could maybe make an infographic that gives you some sort of overview of the hall of fame and the inductees. There are quite a few busts in the hall of fame that leave the viewer wondering why this person was considered Hall of Fame worthy.

I started by separating the figures by subject.

Then I thought I could build something more complicated. This PDF. is supposed to be 2ft by 3ft thats why you have to zoom in to see stuff. I built it with layers in Indesign.

I like this size of data set because it is manageable and I think it gives some real information about this historical architectural spot. This historical landmark represents a national identity that is defined mostly before the second world war. The people who were involved with this landmark were thinking about this country in terms of the civil war. That just blows my mind. Looking at the life spans of each inductee tells a story about how people at the turn of the century were mapping out this nations history and identity. Building this visualization from this little pamphlet dataset allows me to speak very assuredly about the history of this landmark.

LOOK HERE FOR MY POSTER:poster infographicidea

Here is the big file: poster infographicidea

it’s “BIG” data to me: data viz part 2

image 1: the final visualization (keep reading, tho)

Preface: the files related to my data visualization exploration can be located on my figshare project page: Digital Humanities Praxis 2014: Data Visualization Fileset.

In the beginning, I thought I had broken the Internet. My original file (of all the artists at the Tate Modern) in Gephi did nothing… my computer fan just spun ’round and ’round until I had for force quit and shut down*. Distraught — remember my beautiful data columns from the last post?! — I gave myself some time away from the project to collect my thoughts, and realized that in my haste to visualize some data! I had forgotten the basics.

Instead of re-inventing the wheel by creating separate gephi import files for nodes and edges I went to Table2Net and treated the data set as related citations, as I aimed to create a network after all. To make sure this would work I created a test file of partial data using only the entries for Joseph Beuys and Claes Oldenberg. I set the uploaded file to have 2 nodes: one for ‘artist’, the other for ‘yearMade’. The Table2Map file was then imported into gephi.

Screen Shot 2014-11-12 at 9.28.37 PM

Image 2: the first viz, using a partial data set file; a test.

I tinkered with the settings in gephi a bit — altering the weight of the edges/nodes and the color. I set the visualization as Fruchterman-Reingold and voila!, image 2:

With renewed confidence I tried the “BIG” data set again. Table2Net took a little bit longer to export this time. But eventually it worked and I went through the same workflow from the Beuys/Oldenberg set. In the end, I got image 3 below (which looks absolutely crazy):

Screen Shot 2014-11-12 at 9.34.07 PM

Image 3: OOPS, too much data, but I’m not giving up.



To image 3’s credit, watching the actual PDF load is amazing: it slowly opens (at least on my computer) and layers each part of the network, which eventually end up beneath the mass amounts of labels — artist name AND year — that make up the furry looking square blob pictured here. You can see the network layering process yourself by going to the figshare file set and downloading this file.

I then knew two things: little data and “BIG” data need to be treated differently. There were approximately 69,000 rows in the “BIG” data set, and only about 600 rows in the little data set. Remember, I weighted the nodes/edges for Image 2 so that thicker lines represent more connections, hence there not being 600 connecting lines shown.

Removing labels definitely had to happen next to make the visualization legible, but I wanted to make sure that the data was still representative of its source. To accomplish this, I used the map display ForceAtlas and ran it for about 30 seconds. As time passed, the map became more and more similar to my original small data set visualization — with central zones and connectors. Though this final image varies from the original visualization (image 2), the result (image 1) is more legible about itself.

Screen Shot 2014-11-12 at 9.52.56 PM

Image 4: Running ForceAtlas on what was originally image 3.

My major take-away: it’s super easy to misrepresent data, and documentation is important — to ensure that you can replicate yourself, that others can replicate you, and to ensure that the process isn’t just steps to accomplish a task. The result should be a bonus to the material you’re working with and informative to your audience.

I’m not quite sure what I’m saying yet about the Tate Modern. I’ll get there. Until then, take a look at where I started (if you haven’t already).

*I really need a new computer.

Tomorrow: Data Visualization II Workshop!

Hi folks!

Tomorrow’s Digital Praxis Workshop will be Data Visualization II!

I’m kind of freaking out about it.

This one will really take it to the next level, with an under-the-hood look at creating interactive visualizations using d3.

Interactive designer Sarah Groff-Palermo will demonstrate, explain and walk users through exercises in d3, a JavaScript-based interactive programming framework, and its associated technologies and libraries.

Attendees of the workshop are strongly encouraged to bring their own laptop computers and should have, or create, a github account prior to the session. Note: Sarah works on a Mac – but the workshop will also be accessible and beneficial to users of other operating systems (including the room’s library-provided desktop computers).

Looking forward to seeing the Praxis students there (Mina Rees Library, Concourse level, Room C196.02) tomorrow after the class!


Mona Lisa Selfie: data viz part 1

Image from http://zone.tmz.com/, used with permission for noncommercial re-use (thanks Google search filters)

It took me a long time to get here, but I’ve found a data set that I feel comfortable manipulating, and it has given me an idea that I’m not entirely comfortable with executing, but am enjoying thinking about & exploring.

But before I get to that: my data set. I explored for a long time and, if you’ve read my comments, ran into a lot of trouble with RDF files. All the “cool” data I wanted to used was in RDF, and it turns out RDF is my monopoly road block: do not pass go, do not collect $200. So I kept looking, and eventually found a giant CSV file on Github of the artworks at the Tate Modern, along with another more manageable file of the artist data (name, birth date, death date). But let’s make my computer fan spin and look at that artwork file!

It has 69,202 rows and columns that go to “T” (or, 20 columns).
Using ctrl C, ctrl V, and text-to-columns, I was able to browse the data in Excel.

Screen Shot 2014-11-02 at 10.28.02 AM

seemingly jumbled CSV data, imported into Excel

Screen Shot 2014-11-02 at 10.28.14 AM

text to columns is my favorite

Screen Shot 2014-11-02 at 10.28.23 AM

manageable, labelled columns!







I spent a lot of time just browsing the data, as one might browse a museum (see what I did there?). I became intrigued by it in the first place because in my first trip to London this past July, I didn’t actually go to the Tate Modern. My travel companions had already been, and we opted for a pint at The Black Friar instead. I’m looking at data blind, and even though I am familiar with the artists and can preview the included URLs, I haven’t experienced the place or its artwork on my own. Only the data. As such, I wanted to make sure that any subsequent visualization was as accurately representative as I could manage. I started writing down connections that could possibly interest me, or that would be beneficial as a visualization, such as:

  • mapping accession method — purchased, “bequeathed”, or “presented” — against medium and/or artist
  • evaluating trends in art by looking at medium and year made compared to year acquired
  • a number of people have looked at the gender make-up of the artists, so skip that for now
  • volume of works per artist, and volume of works per medium and size

But then I started thinking about altmetrics, again — using social media to track new forms of use & citation in (academic) conversations.

Backtrack: Last week I took a jaunt to the Metropolitan Museum of Art and did a tourist a favor by taking a picture of her and her friend next to a very large photograph. We were promptly yelled at. Such a sight is common in modern-day museums, most notably of late with Beyonce and Jay Z.
What if there was a way to use art data to connect in-person museum visitors to a portable 1) image of the work and 2) information about the work? Unfortunately, the only way I can think to make this happen would be via QR code, which I hate (for no good reason). But then, do visitors really want to have a link or a saved image? The point behind visiting a museum is the experience, and this idea seems too far removed.
What if there was a way to falsify a selfie — to get the in-person experience without being yelled at by men in maroon coats? This would likely take the form of an app, and again, QR codes might need to play a role — as well as a lot of development that I don’t feel quite up for. The visitor is now interacting with the art, and the institution could then collect “used data” to track artwork popularity which could inform future acquisitions or programs.

Though this it’s a bit tangential from the data visualization project, this is my slightly uncomfortable idea developed in the process. I’d love thoughts or feedback or someone to tell me to pump the proverbial breaks. I’ll be back in a day or so with something about the visualizations I’ve dreamed up in the bullets above. My home computer really can’t handle Gephi right now.