Tag Archives: Data Analysis

Tandem Week 13 Update

WEEK 13 TANDEM PROJECT UPDATE:

We are happy to announce that the initial version of our near-polished UI is up and functioning on http://dhtandem.com/. This development means that you can now go to the site and walk through uploading files as well as review some early versions of our documentation.

Immediate next steps for our team include updating the text on the documentation pages to the more robust things we have patiently waiting in the wings while we finalize the connection of the front and back components of the app. We have been powering away at creating thorough documentation and user information to be present on the final site. This also includes our exploration of the Mother Goose corpus which is beginning to take shape (in part thanks to some TANDEM supporters and volunteers from the praxis class). Basically, we’re pushing our data set through various tools for discovery and analysis. These results will become incorporated in the Sample Data section on the TANDEM website, which is intended as an example of the apps potential, and as a learning tool for new users.

As we continue to work on bugs and high priority action items, such as fixing an error with zipping files that originated from a change in processing in this iteration, we are realizing areas that could use strengthening post-dhpraxis. Our function May 19th MVP is so close we can taste it.

The zipping problem mentioned above may be related to another problem, which only happens on the server and cannot be replicated in a development environment. What appears to happen follows: when a user starts a new project, TANDEM builds three folders on the server, one for the uploaded files, one for the final output which is subsequently zipped for download. The third folder is a staging or intermediate directory that can contains files after any pre-processing that is required. For example, PDF files must be converted to JPG for our image analysis software to work. Another example is that the text must be extracted into TXT files via an OCR step for NLTK to be able to consume the content.

These new folders appear to be created successfully, and their locations are saved to global variables in the program. However, when it comes time to write files to the newly created folders, it seems that the file are being written to a previously used set of folders. The problem is intermittent. To make diagnosis more difficult, the zip step sometimes zips the older folder which delivers content from multiple projects to the user. However, other times the zip step zips the new folder which is empty delivering an empty file to the user. At still other times, the files are all read and written properly.

Zipping issues aside, we are moving along. Given all the amazing progress we have made, it is not surprising that buzz for the launch is growing. (Also Jojo invites anyone and everyone she speaks to). With new details regarding presentations, we are ready to get this party started. The DH community at CUNY and in New York has been a part of these projects whether actively or abstractly, and it seems a grand opportunity to celebrate.

 

Part Two, Mapping the Icelandic Outlaw Sagas Narratively

Dear digitalists,

In my last post, I shared a rather lengthy write-up of a geospatial data project I’ve been working on–I hope that some of it is helpful!

Aiming for brevity in this post (and apologies for hogging the blog), I’d like to see if anyone has feedback for part two of the mapping project I’m working on currently. To summarize the project in brief, borrowing directly from my last post: “Driven by my research interests in the spatiality of imaginative reading environments and their potential lived analogues, I set out to create a map of the Icelandic outlaw sagas that could account for their geospatial and narrative dimensions.”

While you can check out those aforementioned geospatial dimensions here, the current visualization I’ve created for those narrative dimensions seems to be lacking. Here it is, and let me describe what I have so far:

Click through for interactive Sigma map

I used metadata from my original XML document, focusing on categories for types of literary or semantic usage of place name in the sagas. I broadly coded each mention of place name in the three outlaw sagas for what “work” it seemed to be doing in the text, featuring the following categories: declarative (Grettir went to Bjarg), possessive (which included geographic features that were not necessarily a place name, but acting as one through the possessive mode, such as Grettir’s farm), affiliation (Grettir from Bjarg) and whether the place name appeared in prose, poetry, or an embedded speech. Using open-source software Gephi, this metadata was transformed into nodes and edges, then arranged in a force algorithm according to a place name weight that accounted for frequency of mentions across the sagas. I used the JavaScript library Sigma to embed the Gephi map into the browser.

While I feel that this network offers a greater degree of granularity on uses of place name, right now I feel also that it has two major weaknesses: 1) it does not interact with the geographic map, and 2) I am not sure how well it captures place name’s use within the narrative itself.

My question to you, fellow digitalists: what are ways that I could really demonstrate how place names function within a narrative? Should I account for narrative’s temporal aspect–the fact that time passes as the narrative unfolds, giving a particular shape to the experience of reading that place names might inform geographically? How could I get an overlay, of sorts, on the geospatial map itself? Should I consider topic modelling, text mining? Are there potential positive aspects of this Gephi work that might be worth exploring further?

Submitting to you, dear readers, with enormous debts of gratitude in advance for your help! And even if you don’t consider yourself a literary expert–please chime in. We all read, and that experience of how potentially geographic elements affect us as readers and create meaning through storytelling is my most essential question.

Mapping the Icelandic Outlaw Sagas

Greetings, fellow digitalists,

*warning: long read*

I am so impressed with everyone’s projects! I feel like I blinked on this blog, and all of a sudden everything started happening! I’ve tried to go back and comment on all your work so far–let me know if I’ve missed anything. Once again, truly grateful for your inspiring work.

Now that it’s my turn: I’d like to share a project that I’ve been working on for the past year or so. I’ll break it down into two blog posts–one where I discuss the first part, and the other that requests your assistance for the part I’m still working on.

A year ago, I received funding from Columbia Libraries Digital Centers Internship Program to work in their Digital Social Sciences Center on a digital project of my own choosing. I’ve always gravitated towards the medieval North Atlantic, particularly with anything dark, brooding, and scattered with thorns and eths (these fabulous letters didn’t make it from Old English and Old Norse into our modern alphabet: Þ, þ are thorn, and Ð, ð are eth). Driven by my research interests in the spatiality of imaginative reading environments and their potential lived analogues, I set out to create a map of the Icelandic outlaw sagas that could account for their geospatial and narrative dimensions.

Since you all have been so wonderfully transparent in your documentation of your process, to do so discursively for a moment: the neat little sentence that ends the above paragraph has been almost a year in the making! The process of creating this digital project was messy, and it was a constant quest to revise, clarify, research, and streamline. You can read more about this process here and here and here, to see the gear shifts, epic flubs, and general messiness this project entails.

But, to keep with this theme of documentation as a means of controlling data’s chaotic properties, I’ve decided to thematically break down this blog post into elements of the project’s documentation. Since we’ve already had some excellent posts on Gephi and data visualization, I’ll only briefly cover that part of my project towards the end–look for more details on that part two in another blog post, like I mention above.

As a final, brief preface: some of these sections have been borrowed from my actual codebook that I submitted in completion of this project this past summer, and some parts are from an article draft I’m writing on this topic–but the bulk of what you’ll read below are my class-specific thoughts on data work and my process. You’ll see the section in header font, and the explanation below. Ready?

Introduction to the Dataset

The intention of this project was to collect data on place name in literature in order to visualize and analyze it from a geographic as well as literary perspective. I digitized and encoded three of the Icelandic Sagas, or Íslendingasögur, related to outlaws from the thirteenth and fourteenth centuries, titled Grettis saga (Grettir’s Saga), Gísla saga Súrssonar (Gisli’s Saga), and Hardar Saga og Hölmverja (The Saga of Hord and the People of Holm). I then collected geospatial data through internet sources (rather than fieldwork, although this would be a valuable future component) at the Data Service of the Digital Social Sciences Center of Columbia Libraries, during the timeframe of September 17th, 2013, to June 14th, 2014. Additionally, as part of my documentation of this data set, I had to record all of the hardware, software, and Javascript libraries I used–this, along with the mention of the date, will allow my research to be reproduced and verified.

Data Sources

Part of the reason I wanted to work with medieval texts is their open-source status; stories from the Íslendingasögur are not under copyright in their original Old Norse or in most 18th and 19th century translations and many are available online. However, since this project’s time span was only a year, I didn’t want to spend time laboriously translating Old Norse when the place names I was extracting from the sagas would be almost identical in translation. With this in mind, I used the most recent and definitive English translations of the sagas to encode place name mentions, and cross-referenced place names with the original Old Norse when searching for their geospatial data (The Complete Sagas of Icelanders, including 49 Tales. ed. Viðar Hreinsson. Reykjavík: Leifur Eiríksson Pub., 1997).

Universe

When I encountered this section of my documentation (not as a data scientist, but as a student of literature), it took me a while to consider what it meant. I’ll be using the concept of “data’s universe,” or the scope of the data sample, as the fulcrum for many of the theoretical questions that have accompanied this project, so prepare yourself to dive into some discipline-specific prose!

On the one hand, the universe of the data is the literary world of the Icelandic Sagas, a body of literature from the 13th and 14th centuries in medieval Iceland. Over the centuries, they have been transmuted from manuscript form, to transcription, to translation in print, and finally to digital documents—the latter of which has been used in this project as sole textual reference. Given the manifold nature of their material textual presence—and indeed, the manuscript variations and variety of textual editions of each saga—we cannot pinpoint the literary universe to a particular stage of materiality, since to privilege one form would exclude another. Seemingly, my data universe would be the imaginative and conceptual world of the sagas as seen in their status as literary works.

A manuscript image from Njáls saga in the Möðruvallabók (AM 132 folio 13r) circa 1350, via Wikipedia

However, this does not account for the geospatial element of this project, or the potential real connections between lived experience in the geographic spaces that the sagas depict. Shouldn’t the data universe accommodate Iceland’s geography, too? The act of treating literary spaces geographically, however, is a little fraught: in this project, I had to negotiate specifically the idea that mapping the sagas is at all possible, from a literary perspective. In the latter half of the twentieth century, scholars considered the sagas as primarily literary texts, full of suspicious monsters and other non-veracities, and from this perspective could not possibly be historical. Thus, the idea of mapping the sagas would have been irrelevant according to this logic, since seemingly factual inconsistencies undermined the historical, and thus geographic, possibilities of the sagas at every interpretive turn.

However, interdisciplinary efforts have been increasingly challenging this dismissive assumption. Everything from studies on pollen that confirm the environmental difficulties described in the sagas, to computational studies that suggest the social patterns represented in the Icelandic sagas are in fact remarkably similar quantitatively to genuine relationships suggest that the sagas are concerned with the environment and geography that surrounded their literary production.

But even if we can create a map of these sagas, how can we counter the critiques of mapping that it promotes an artificial “flattening” of place, removing the complexity of ideas by stripping them down to geospatial points? Our course text, Hypercities, speaks to this challenge by proposing the creation of “deep maps” that account for temporal, cultural, and emotional dimensions that inform the production of space. I wanted to preserve the idea of the “deep map” in my geospatial project on the Icelandic Sagas, so in a classic etymological DH move (shout out to our founding father Busa, as ever), I attempted to find out more about where the idea of “deep mapping” might have predated Hypercities, which only came out this year yet represents a far earlier concept.

I traced the term “deep mapping” back to William Least Heat-Moon, who coined the phrase in the title of his book, PrairyErth (A Deep Map) to indicate the “detailed describing of place that can only occur in narrative” (Mendelson, Donna. 1999. “‘Transparent Overlay Maps’: Layers of Place Knowledge in Human Geography and Ecocriticism.” Interdisciplinary Literary Studies 1:1. p. 81). According to this definition, “deep maps” occur primarily in narrative, creating depictions on places that may be mapped on a geographic grid that can never truly account for the density of experience that occurs in these sites. Heat-Moon’s use of the phrase, however, does not preclude earlier representations of the concept; the use of narratives that explore particular geographies is as old as the technology of writing. In fact, according to Heat-Moon’s conception of deep mapping, we might consider the medieval Icelandic sagas a deep map in their detailed portrayal of places, landscape, and the environment in post-settlement Iceland. Often occurring around the locus of a regional few farmsteads, the Sagas describe minute aspects of daily Icelandic life, including staking claim to beached whales as driftage rights, traveling to Althing (now Thingvellir) for annual law symposiums, and traversing valleys on horseback to seek out supernatural foes for battle. Adhering to a narrative form not seen again until the rise of the novel in the 18th century, the Íslendingasögur are a unique medieval exempla for Heat-Moon’s concept of deep mapping and the resulting geographic complexity that results from narrative. Thus, a ‘deep map’ may not only include a narrative, such as in the sagas’ plots, but potentially also a geographic map for the superimposition of knowledge upon it–allowing these layers of meaning to build and generate new meaning.

To tighten the data universe a little more: specifically within the sagas, I have chosen the outlaw-themed sagas for their shared thematic investment in place names and geography. Given that much of the center of Iceland today consists of glaciers and wasteland, outlaws had precious few options for survival once pushed to the margins of their society. Thus, geographic aspects of place name seem to be just as essential to the narrative of sagas as their more literary qualities—such as how they are used in sentences, or what place names are used to obscure or reveal.

Map of Iceland, by Isaac de La Peyrère, Amsterdam, 1715. via Cornell University Library, Icelandica Collection

In many ways, the question of “universe” for my data is the crux of my research question itself: how do we account for the different intersections of universes—both imaginative and literary, as well as geographic and historical—within our unit of analysis?

Unit of Analysis

If we dissect the element that allows geospatial and literary forms to interact, we arrive at place name. Place names are a natural place for this site of tension between literary and geographic place, since they exist in the one shared medium of these two modes of representation: language. In their linguistic as well as geographic connotations, place names function as the main site of connection between geographic and narrative depictions of space, and it is upon this argument that this project uses place name as its unit of analysis.

Methodology

Alright, now that we’re out of the figurative woods, on to the data itself. Here are the steps I used to create a geospatial map with metadata for these saga place names.

Data Collection, Place Names:

The print text was scanned and digitized using ABBYY FineReader 11.0, which performs Optical Character Recognition to ensure PDFs are readable (or “optional character recognition, as I like to say) and converted to an XML file. I then used the flexible coding format of the XML to hand-encode place name mentions using TEI protocol and a custom schema. In the XML file, names were cleaned from OCR and standardized according to Anglicized spellings to ensure searchability across the data, and for look-up in search engines such as Google–this saved a step in data clean-up once I’d transformed the XML into a CSV.

 

KinniburghXML

Here’s the TEI header of the XML document–note that it’s nested tags, just like HTML.

Data Extraction / Cleanup

In order to extract data, the XML document was saved as a CSV. Literally, “File > “Save As.” This is a huge benefit of using flexible mark-up like XML, as opposed to annotation software that can be difficult to extract data from, such as NVivo, which I wrote about here on Columbia University Library’s blog in a description of my methodology. In the original raw, uncleaned file, columns represented encoding variables, and rows represented the encoded text. I cleaned the data to eliminate column redundancies and extraneous blank spaces, as well as to preserve the following variables: place name, chapter and saga of place name, type of place name usage, and place name presence in poetry, prose, or speech. I also re-checked my spelling here, too–next time, no hand-encoding!

 

KinniburghCSV

Here’s the CSV file after I cleaned it up (it was a mess at first!)

KinniburghCSVKey

I saved individual CSVs, but also kept related info in an Excel document. One sheet, featured here, was a key to all the variables of my columns, so anyone could decipher my data.

Resulting Metadata:

Once extracted, I geocoded place names using the open-source soft- ware Quantitative Geographic Information Systems (QGIS), which is remarkable similar to ArcGIS except FREE, and was able to accommodate some of those special Icelandic characters I discussed earlier. The resulting geospatial file is called a shapefile, and QGIS allows you to link the shapefile (containing your geospatial data) with a CSV (which contains your metadata). This feature allowed me to link my geocoded points to their corresponding metadata (the CSV file that I’d created earlier, which had place name, its respective saga, all that good stuff) with a unique ID number.

Data Visualization, or THE BIG REVEAL   

While QGIS is a powerful and very accessible software, it’s not the most user-friendly. It takes a little time to learn, and I certainly did not expect everyone who might want to see my data would also want to learn new software! To that end, I used the JavaScript library Leaflet to create an interactive map online. You can check it out here–notice there’s a sidebar that lets you filter information on what type of geographic feature the place name comprises, and pop-ups appear when you click on a place name so you can see how many times it occurs within the three outlaw sagas. Here’s one for country mentions, too.

 

KinniburghLeaflet

Click on this image to get to the link and interact with the map.

Takeaways

As the process of this documentation highlights, I feel that working with data is most labor-intensive when it comes to positioning the argument you want your data to make. Of course, actually creating the data by encoding texts and geocoding takes a while too, but so much of the labor in data sets in the humanities is intellectual and theoretical. I think this is a huge strength in bringing humanities scholars towards digital methodologies: the techniques that we use to contextualize very complex systems like literature, fashion, history, motherhood, Taylor Swift (trying to get some class shout-outs in here!) all have a LOT to add to the treatment of data in this digital age.

Thank you for taking the time to read this–and please be sure to let me know if you have any questions, or if I can help you in mapping in any way!

In the meantime, stay tuned for another brief blog post, where I’ll solicit your help for the next stage of this project: visualizing the imaginative components of place name as a corollary to this geographic map.

Info Visualization Workshop

It was standing room only in Micki’s info viz workshop on Thursday. In order to make the demo more interesting, she used a dataset about the class attendees. We all entered our names, school, department & year in a shared online doc which became the basis for parts of the demo. We saw how to take text and clean it up for entry into Excel using a text edit tool, Text Wrangler. Tabs! Tabs are the answer. Data separated by tabs will go into individual cells in Excel, making it easier to manipulate once in there. Tabs>commas, apparently.

Once the data was in Excel, we saw some basic functions like using the data filter function, making a pivot table, an area graph and a stacked area graph.

After Excel we moved on to Gephi. Unfortunately none of the participants could get Gephi on our computers, so we just watched Micki do a demo. Using our class participant data, she showed us the steps to get the data in and how to do some basic things to get a good looking visualization, and how to play around with different algorithms and options. This was a pretty small dataset with few connections, so to illustrate some of the more complex things Gephi can do, Micki showed us examples from her own work. For me, this was the best part. I think Liam linked to it earlier, but I highly recommend you look at the force-directed graphs section on Quantifying Kissinger.

Stephen brought up the ‘so what?’ factor with regard to Lev Manovich’s visualizations. I thought Micki’s provided a good counterpoint to that, as she explained how certain visualizations made patterns or connections clear—things that might not have been revealed in another type of analysis.

Overall this was a very informative and useful workshop. It gave me courage to go home and play with my data in Gephi in ways that I didn’t feel able to before, and I hope it encouraged others to get started on their own projects.

 

Mapping Data: Workshop 3/3

Hi all,

Just to follow up on Mary Catherine’s post about finding data, I wanted to recap the final session of this workshop series that took place tonight.

The library guide on mapping data (by Margaret Smith) can be found here: http://libguides.gc.cuny.edu/mappingdata

As in the other two workshops, Smith emphasized thinking about who would be keeping this data and why as a part of the critical research process. It’s especially interesting given the size of these data sets and maps, meaning that the person (or corporate entity, NGO, or government agency) likely has a very specific reason for hosting this information.

She brought us through a few examples from basic mapping sites, like the NYT’s “Mapping America” which pulls on 2005-2009 Census Bureau data, to basic mapping applications like Social Explorer (the free edition has limited access, but the GC has bought full access) and the USGS and NASA mapping applications. The guide also includes a few more advanced mapping options, like ArcGIS, but the tool that seemed most useful to me, in the short-term anyway, is Google’s Fusion Tables, which allows you to merge data sets that have terms in common. The example Smith used was a data set of demographic data (her example was percentage of minority students) organized by town name (her example was towns in Connecticut) and a second set of data that defined geographic boundaries by the same set of towns. Fusion Tables then lets you map the demographic data and select various ways to visualize and customize your results.

My main takeaway from this series was that each of these tools is highly particular and unique, and you have to really dig into playing with the individual system before you’ll even know if it is the right tool for your work.

That, and also learn R.

Dataset Project: Who do you listen to?

The basis of my dataset is my iTunes library. I chose this because it was easily accessible, and because I was interested to see what the relationships in it would look like visualized. My 2-person household has only one computer (a rarity these days, it seems) which means that everything on it is shared, including the iTunes library. Between the two of us, we’ve amassed a pretty big collection of music in a digital format. (Our physical (non-digital) music collection is also merged but it is significantly larger than the digital one and only a portion of it is cataloged, so I didn’t want to attempt anything with it.)

I used an Apple script to extract a list of artists as a text list, which I then put into Excel. I thought about mapping artists to each other through shared members, projects, labels, producers etc, but after looking at the list of over 2000 artists (small by Lev Manovich standards!), I decided that while interesting, this would be too time consuming.

My other idea was easier to implement: mapping our musical affinities. After cleaning up duplicates, I was left with a list of around 1940 artists. Both of us then went through the list and indicated if we listened to that artist/band/project, and gave a weight to the relationship on a scale of one to five (1=meh and 5=essential). It looked like this:

 

Screen shot 2014-10-27 at 9.57.03 PM

Sample of the data file

Ultimately I identified 528 artists and J identified 899. An interesting note about this process, after we both went through the list, there were approximately 500 artists that neither of us acknowledged. Some of this can be chalked up to people on compilations we might not be that familiar with individually. The rest…who knows?

Once this was done I put the data into Gephi. At the end of my last post I was having trouble with this process. After some trial and error, I figured it out. It was a lot like the process I used with the .gdf files in my last post. The steps were: save the Excel file as CSV and uncheck append file extension, then open that file in TextEdit and save as UTF-8 format AND change the file extension to .csv. Gephi took this file with no problems.

The troublesome process of coding and preparing the data for analysis done, it was time for the fun stuff. As with my last visualization, I used the steps from the basic tutorial to create a ForceAtlas layout graph. Here it is without labels:

w_I_u_2

The assigned weight to each relationship is shown in the distance from our individual node, and also in the thickness of the edge (line) that attaches the nodes. It can be hard to see without zooming in closely on the image, since with so many edges it is kind of noisy.

Overall, I like the visualization. It doesn’t offer any new information, but it accurately reflects the data I had. Once I had the trouble spots in the process worked out, it went pretty smoothly.

I am not sure if ForceAtlas is the best layout for this information. I will look into other layout options and play around with them, see if it looks better or worse.

I made an image with the nodes labeled, but it become too much to look at as a static image. To this end, I want to work on using Sigma (thanks to Mary Catherine for the tip!)  to make the graph interactive, which would enable easier viewing of the relationships and the node labels, especially the weights. This may be way beyond my current skill level, but I’m going to give it a go.

ETA: the above image is a jpeg, here is a PDF to download if you want to have better zoom options w_I_u_2

Finding Data: Preliminary Questions

Hello, all,

As promised, here’s a link to the “Finding Data” library guide on the Mina Rees Library site. Apologies if someone has posted it already!

http://libguides.gc.cuny.edu/findingdata

The guide was created by the wonderful Margaret Smith, an adjunct librarian at the GC Library who is teaching the workshops on data for social research. There’s one more–Wednesday, 6:30-8:30pm downstairs in the library in one of the computer labs–and I’m sure she’d be happy to have anyone swing by. Check out the Library’s blog for details.

Within this guide, the starting questions that Smith provides, in order to get you thinking of your dataset theoretically as well as practically, are very helpful–and I wish I had them years ago! Here are some highlights, taken directly from the guide (but you should really click through!):

HOW TO FIND DATA:

When searching for data, ask yourself these questions…

Who has an interest in collecting this data?

  • If federal/state/local agencies or non-governmental organizations, try locating their website and looking for a section on research or data.
  • If social science researchers, try searching ICPSR.

What literature has been written that might reference this data?

  • Search a library database or Google Scholar to find articles that may have used the data you’re looking for. Then, consult their bibliographies for the specific name of the data set and who collected it.

HOW TO CONSIDER USING IT:

Is the data…

  • From a reliable source? Who collected it and how?
  • Available to the public? Will I need to request permission to use it? Are there any terms of use? How do I cite the data?
  • In a format I can use for analysis or mapping? Will it require any file conversion or editing before I can use it?
  • Comparable to other data I’m using (if any)? What is the unit of analysis? What is the time scale and geography? Will I need to recode any variables?

And another thought that I really loved from her first workshop in this series:

Consider data as an argument.

Since data is social, what factors go into its production? What questions does the data ask? And how do the answers to these questions, as well as the questions above, affect the ways in which that dataset can shed light on your research questions?

All fantastic stuff–looking forward to seeing more of these data inquiries as they pop up on the blog!

(again, all bulleted text is from the “Finding Data” Lib Guide, by Meg Smith, Last Updated Oct. 15 2014. http://libguides.gc.cuny.edu/findingdata)

MC

Dataset Project: Testing Gephi

I found the projects on the Visual Complexity site really beautiful and interesting, and I was inspired to start playing with Gephi in anticipation of using it for my dataset project.

 

I’m happy I started early! I downloaded the most recent version of Gephi and went through the tutorial using the Les Miserables sample dataset with no problems. I figured since that was so easy, I’d go ahead and visualize my Facebook network, just for fun.

 

I used Netvizz to extract my FB data. I immediately ran into problems getting the data into a format Gephi could read. Netvizz says to ‘right click, save as’, which wasn’t actually an option. Ultimately I opened the .gdf data in the browser, cut and pasted into a an Excel file to save as a csv, and also pasted the same data into a text file and saved. The Excel csv data would load into Gephi, but the IDs and labels were all wonky, and the graph was clearly a mess with number strings as node labels. I then tried the text file, which threw up error an message and wouldn’t even open. Some amount of Googling & trial and error later, I discovered I had to change the format to UTF-8, and change the file extension from .txt to .gdf.

 

Once that was sorted, I had trouble displaying the data in ‘data laboratory’ view of Gephi. I eventually discovered that Macs (or maybe just my really old Mac) are not, for some reason, entirely compatible with the current version of Gephi. OK. Uninstall the current version, re-install an older version. Fortunately that solved that particular problem.

 

So! Eventually I was able to get the data to open properly, and the graph to start looking like it should. I used the same steps from the tutorial to create a graph of my FB network. This part was easy, as it had been with the sample dataset–just following the instructions for a really basic visualization. Beautiful.

 

FB_noname_viz

 

Feeling emboldened by my problem solving and subsequent success, I started to play with a small part of my (anticipated) dataset. Several hours (and one trip to the grocery store) later, I’ve not been able to get the data into the proper format so that 1) Gephi will take it, and 2) it will display connections correctly. I can get one or the other, but not both.

 

This will definitely take more research & finessing, but I’m hopeful that I will be able to get it all to work. Stay tuned for the scintillating conclusion!

And so it begins.

In May of 2013, I graduated Queens College. I spent a small fortune, pennies compared to most, to receive a piece of paper that gave validity to my ramblings about how cool I thought Chaucer was/is. I got a degree in English. I must be crazy. Shortly after, I got a job at a magazine. The problem was it wasn’t in the editorial department.

That fall, I began working as a Digital Media Planner a decent sized publishing company. The reality of digital publication soon came to light. It was my responsibility to develop online advertising strategies for blue-chip brands looking to hit wealthy middle-aged men and women or intelligent millennial thought-starters across our sites.

What is the exact frequency and quantity of annoying flashing advertisements we can throw at users before they stop coming back?

Moreover, how much money can we make off the millions of branded pictures, animations, and videos we position next to our content? We are only shooting for a click-through rate of 0.05% (about 5 out of every 10,000 ads).

Needless to say, in a few short months I began to feel anxious about the amount of information ad servers can gather on users online habits. While shopping data management platforms for our sites, we heard promises of user profile optimization that would create content and ad experiences specific to a particular person. My web is different than your web.

Omnichannel personalization. Behavioral data. Interest profiles. Purchase histories.

Suddenly, I was hyper aware of how fabricated the thin veneer of the web really was. Most of us interact with a variety of publications on a daily basis, hitting a top tier of social networks along the way, and reading highly curated content that caters to our need to digest quickly and move on.

What I also realized was the power of data behind this scheme. There are entire industries pivoted on gathering, sourcing, organizing, analyzing, and visualizing these enormous pools of information.

It is much easier to sell a brand an ad campaign when they know their target demographic is exactly who is going to see it.

I began falling in love with data. Call it a complex, but it was essentially a game to see if I was able to make these half-million dollar campaigns could work. I spent hours analyzing user flows, traffic rates, and article statistics.

Visualizations began to tell stories. Charts and infographics were as nuanced as poetry.

I hated the job, but I loved the data.

I come to digital humanities with this love of data and degree in literature. Independently rooted, I hope to unite these two spheres and find common ground this year.

And so it begins.