Tag Archives: Data Project

“Playing” with Images

I was very taken with Lev Manovich’s article, “How to Compare One Million Images?”, on image visualization that dealt with ImagePlot and its use in his project, although at that time I wasn’t thinking of using it the dataset “play” project. I am a visually driven person, and spend quite a bit of time playing around with images. Similar to those who relax with books, I curl up with images, and spend a lot of time gazing at pictures. And, also an almost equal amount of time searching for them. So, with my new-found awareness of data, I began wondering if my preferences could be quantified, and use the resultant measures as search criteria?

So with the dataset project in mind, I went back to Manovich’s article and read it again to get details, which directed me to the Software Studies website to download ImageJ. I then downloaded the macro, ImagePlot, required for image visualization. After installing it in ImageJ, I set about finding its requirements for visualization from the software documentation. All that ImagePlot required was an image collection with associated metadata. I put together a set of 135 images from my personal collection after sifting through 600 odd images. I took particular care to include only those that I really liked, so the results would be meaningful.

As ImagePlot automatically scales the images to an uniform size, it was enough to just pull all the pictures together into a single folder. (ImagePlot documentation does mention that such a step is not required, as it is capable of handling images stored at different locations in a computer.)

Now that I had the image set in place, I went back to the documentation to know what format was required of the metadata, which happened to be ‘delimited tab text’. At first, assuming the metadata had to manually assembled, I spent some time creating a trial file for 20 images in that format. Once it became apparent this would be time consuming, I went back to the documentation and came to know that ImageJ does ‘batch’ (measuring multiple images in one step) image processing and measuring, the results of which are stored as a .csv file by default. Just choose the features that are to be measured (image brightness, gray values, etc.), click on ‘measure’ and, in one stroke, metadata appropriate for the image visualization is created by ImageJ itself! Overjoyed and very appreciative of ImageJ, I proceeded to convert this .csv file to the ‘delimted tab’ .txt format in Excel and was finally all set to go.

Snapshot of Metadata in Delimited Tab .txt format

I chose to measure mean gray value (y-axis) and intensity (x-axis) of the images and plotted the values with the following results.

Through the visualization, I was able to see the range of gray values and intensity my images possessed. It seems I prefer images that are bright with less grayness, and of moderate intensity. Most of the images are of medium to low gray values, with very few in the high gray and high intensity category. The lines link images of similar characteristics and show how the images relate to each other.

As a next step, I intend to pursue animated visualizations now that I’m familiar with the visualization process. The biggest revelation for me was the documentation that accompanied the software. I’d always assumed that answers had to be found elsewhere from knowledgeable users, but most of my questions were answered by the documentation itself. Worked out sample projects that accompanied the software were helpful as well. These resources gave me the confidence to approach the project and fix errors in processing. Also, understanding data formats and creating metadata for the images were equally empowering.

So, going back to my earlier question – can my image preferences be quantified? Yes. But, I am yet to figure out how to use these values as search criteria for image collections. That, is where I go from here.

ImagePlot and ImagePlot Documentation can be found here –

http://lab.softwarestudies.com/p/imageplot.html

https://docs.google.com/document/d/1zkeik0v2LJmi1TOK4OxT7dVKJO7oCmx_fNP8SYdTG-U/edit?hl=en_US&pli=1#

Part Two, Mapping the Icelandic Outlaw Sagas Narratively

Dear digitalists,

In my last post, I shared a rather lengthy write-up of a geospatial data project I’ve been working on–I hope that some of it is helpful!

Aiming for brevity in this post (and apologies for hogging the blog), I’d like to see if anyone has feedback for part two of the mapping project I’m working on currently. To summarize the project in brief, borrowing directly from my last post: “Driven by my research interests in the spatiality of imaginative reading environments and their potential lived analogues, I set out to create a map of the Icelandic outlaw sagas that could account for their geospatial and narrative dimensions.”

While you can check out those aforementioned geospatial dimensions here, the current visualization I’ve created for those narrative dimensions seems to be lacking. Here it is, and let me describe what I have so far:

Click through for interactive Sigma map

I used metadata from my original XML document, focusing on categories for types of literary or semantic usage of place name in the sagas. I broadly coded each mention of place name in the three outlaw sagas for what “work” it seemed to be doing in the text, featuring the following categories: declarative (Grettir went to Bjarg), possessive (which included geographic features that were not necessarily a place name, but acting as one through the possessive mode, such as Grettir’s farm), affiliation (Grettir from Bjarg) and whether the place name appeared in prose, poetry, or an embedded speech. Using open-source software Gephi, this metadata was transformed into nodes and edges, then arranged in a force algorithm according to a place name weight that accounted for frequency of mentions across the sagas. I used the JavaScript library Sigma to embed the Gephi map into the browser.

While I feel that this network offers a greater degree of granularity on uses of place name, right now I feel also that it has two major weaknesses: 1) it does not interact with the geographic map, and 2) I am not sure how well it captures place name’s use within the narrative itself.

My question to you, fellow digitalists: what are ways that I could really demonstrate how place names function within a narrative? Should I account for narrative’s temporal aspect–the fact that time passes as the narrative unfolds, giving a particular shape to the experience of reading that place names might inform geographically? How could I get an overlay, of sorts, on the geospatial map itself? Should I consider topic modelling, text mining? Are there potential positive aspects of this Gephi work that might be worth exploring further?

Submitting to you, dear readers, with enormous debts of gratitude in advance for your help! And even if you don’t consider yourself a literary expert–please chime in. We all read, and that experience of how potentially geographic elements affect us as readers and create meaning through storytelling is my most essential question.

Mapping the Icelandic Outlaw Sagas

Greetings, fellow digitalists,

*warning: long read*

I am so impressed with everyone’s projects! I feel like I blinked on this blog, and all of a sudden everything started happening! I’ve tried to go back and comment on all your work so far–let me know if I’ve missed anything. Once again, truly grateful for your inspiring work.

Now that it’s my turn: I’d like to share a project that I’ve been working on for the past year or so. I’ll break it down into two blog posts–one where I discuss the first part, and the other that requests your assistance for the part I’m still working on.

A year ago, I received funding from Columbia Libraries Digital Centers Internship Program to work in their Digital Social Sciences Center on a digital project of my own choosing. I’ve always gravitated towards the medieval North Atlantic, particularly with anything dark, brooding, and scattered with thorns and eths (these fabulous letters didn’t make it from Old English and Old Norse into our modern alphabet: Þ, þ are thorn, and Ð, ð are eth). Driven by my research interests in the spatiality of imaginative reading environments and their potential lived analogues, I set out to create a map of the Icelandic outlaw sagas that could account for their geospatial and narrative dimensions.

Since you all have been so wonderfully transparent in your documentation of your process, to do so discursively for a moment: the neat little sentence that ends the above paragraph has been almost a year in the making! The process of creating this digital project was messy, and it was a constant quest to revise, clarify, research, and streamline. You can read more about this process here and here and here, to see the gear shifts, epic flubs, and general messiness this project entails.

But, to keep with this theme of documentation as a means of controlling data’s chaotic properties, I’ve decided to thematically break down this blog post into elements of the project’s documentation. Since we’ve already had some excellent posts on Gephi and data visualization, I’ll only briefly cover that part of my project towards the end–look for more details on that part two in another blog post, like I mention above.

As a final, brief preface: some of these sections have been borrowed from my actual codebook that I submitted in completion of this project this past summer, and some parts are from an article draft I’m writing on this topic–but the bulk of what you’ll read below are my class-specific thoughts on data work and my process. You’ll see the section in header font, and the explanation below. Ready?

Introduction to the Dataset

The intention of this project was to collect data on place name in literature in order to visualize and analyze it from a geographic as well as literary perspective. I digitized and encoded three of the Icelandic Sagas, or Íslendingasögur, related to outlaws from the thirteenth and fourteenth centuries, titled Grettis saga (Grettir’s Saga), Gísla saga Súrssonar (Gisli’s Saga), and Hardar Saga og Hölmverja (The Saga of Hord and the People of Holm). I then collected geospatial data through internet sources (rather than fieldwork, although this would be a valuable future component) at the Data Service of the Digital Social Sciences Center of Columbia Libraries, during the timeframe of September 17th, 2013, to June 14th, 2014. Additionally, as part of my documentation of this data set, I had to record all of the hardware, software, and Javascript libraries I used–this, along with the mention of the date, will allow my research to be reproduced and verified.

Data Sources

Part of the reason I wanted to work with medieval texts is their open-source status; stories from the Íslendingasögur are not under copyright in their original Old Norse or in most 18th and 19th century translations and many are available online. However, since this project’s time span was only a year, I didn’t want to spend time laboriously translating Old Norse when the place names I was extracting from the sagas would be almost identical in translation. With this in mind, I used the most recent and definitive English translations of the sagas to encode place name mentions, and cross-referenced place names with the original Old Norse when searching for their geospatial data (The Complete Sagas of Icelanders, including 49 Tales. ed. Viðar Hreinsson. Reykjavík: Leifur Eiríksson Pub., 1997).

Universe

When I encountered this section of my documentation (not as a data scientist, but as a student of literature), it took me a while to consider what it meant. I’ll be using the concept of “data’s universe,” or the scope of the data sample, as the fulcrum for many of the theoretical questions that have accompanied this project, so prepare yourself to dive into some discipline-specific prose!

On the one hand, the universe of the data is the literary world of the Icelandic Sagas, a body of literature from the 13th and 14th centuries in medieval Iceland. Over the centuries, they have been transmuted from manuscript form, to transcription, to translation in print, and finally to digital documents—the latter of which has been used in this project as sole textual reference. Given the manifold nature of their material textual presence—and indeed, the manuscript variations and variety of textual editions of each saga—we cannot pinpoint the literary universe to a particular stage of materiality, since to privilege one form would exclude another. Seemingly, my data universe would be the imaginative and conceptual world of the sagas as seen in their status as literary works.

A manuscript image from Njáls saga in the Möðruvallabók (AM 132 folio 13r) circa 1350, via Wikipedia

However, this does not account for the geospatial element of this project, or the potential real connections between lived experience in the geographic spaces that the sagas depict. Shouldn’t the data universe accommodate Iceland’s geography, too? The act of treating literary spaces geographically, however, is a little fraught: in this project, I had to negotiate specifically the idea that mapping the sagas is at all possible, from a literary perspective. In the latter half of the twentieth century, scholars considered the sagas as primarily literary texts, full of suspicious monsters and other non-veracities, and from this perspective could not possibly be historical. Thus, the idea of mapping the sagas would have been irrelevant according to this logic, since seemingly factual inconsistencies undermined the historical, and thus geographic, possibilities of the sagas at every interpretive turn.

However, interdisciplinary efforts have been increasingly challenging this dismissive assumption. Everything from studies on pollen that confirm the environmental difficulties described in the sagas, to computational studies that suggest the social patterns represented in the Icelandic sagas are in fact remarkably similar quantitatively to genuine relationships suggest that the sagas are concerned with the environment and geography that surrounded their literary production.

But even if we can create a map of these sagas, how can we counter the critiques of mapping that it promotes an artificial “flattening” of place, removing the complexity of ideas by stripping them down to geospatial points? Our course text, Hypercities, speaks to this challenge by proposing the creation of “deep maps” that account for temporal, cultural, and emotional dimensions that inform the production of space. I wanted to preserve the idea of the “deep map” in my geospatial project on the Icelandic Sagas, so in a classic etymological DH move (shout out to our founding father Busa, as ever), I attempted to find out more about where the idea of “deep mapping” might have predated Hypercities, which only came out this year yet represents a far earlier concept.

I traced the term “deep mapping” back to William Least Heat-Moon, who coined the phrase in the title of his book, PrairyErth (A Deep Map) to indicate the “detailed describing of place that can only occur in narrative” (Mendelson, Donna. 1999. “‘Transparent Overlay Maps’: Layers of Place Knowledge in Human Geography and Ecocriticism.” Interdisciplinary Literary Studies 1:1. p. 81). According to this definition, “deep maps” occur primarily in narrative, creating depictions on places that may be mapped on a geographic grid that can never truly account for the density of experience that occurs in these sites. Heat-Moon’s use of the phrase, however, does not preclude earlier representations of the concept; the use of narratives that explore particular geographies is as old as the technology of writing. In fact, according to Heat-Moon’s conception of deep mapping, we might consider the medieval Icelandic sagas a deep map in their detailed portrayal of places, landscape, and the environment in post-settlement Iceland. Often occurring around the locus of a regional few farmsteads, the Sagas describe minute aspects of daily Icelandic life, including staking claim to beached whales as driftage rights, traveling to Althing (now Thingvellir) for annual law symposiums, and traversing valleys on horseback to seek out supernatural foes for battle. Adhering to a narrative form not seen again until the rise of the novel in the 18th century, the Íslendingasögur are a unique medieval exempla for Heat-Moon’s concept of deep mapping and the resulting geographic complexity that results from narrative. Thus, a ‘deep map’ may not only include a narrative, such as in the sagas’ plots, but potentially also a geographic map for the superimposition of knowledge upon it–allowing these layers of meaning to build and generate new meaning.

To tighten the data universe a little more: specifically within the sagas, I have chosen the outlaw-themed sagas for their shared thematic investment in place names and geography. Given that much of the center of Iceland today consists of glaciers and wasteland, outlaws had precious few options for survival once pushed to the margins of their society. Thus, geographic aspects of place name seem to be just as essential to the narrative of sagas as their more literary qualities—such as how they are used in sentences, or what place names are used to obscure or reveal.

Map of Iceland, by Isaac de La Peyrère, Amsterdam, 1715. via Cornell University Library, Icelandica Collection

In many ways, the question of “universe” for my data is the crux of my research question itself: how do we account for the different intersections of universes—both imaginative and literary, as well as geographic and historical—within our unit of analysis?

Unit of Analysis

If we dissect the element that allows geospatial and literary forms to interact, we arrive at place name. Place names are a natural place for this site of tension between literary and geographic place, since they exist in the one shared medium of these two modes of representation: language. In their linguistic as well as geographic connotations, place names function as the main site of connection between geographic and narrative depictions of space, and it is upon this argument that this project uses place name as its unit of analysis.

Methodology

Alright, now that we’re out of the figurative woods, on to the data itself. Here are the steps I used to create a geospatial map with metadata for these saga place names.

Data Collection, Place Names:

The print text was scanned and digitized using ABBYY FineReader 11.0, which performs Optical Character Recognition to ensure PDFs are readable (or “optional character recognition, as I like to say) and converted to an XML file. I then used the flexible coding format of the XML to hand-encode place name mentions using TEI protocol and a custom schema. In the XML file, names were cleaned from OCR and standardized according to Anglicized spellings to ensure searchability across the data, and for look-up in search engines such as Google–this saved a step in data clean-up once I’d transformed the XML into a CSV.

Here’s the TEI header of the XML document–note that it’s nested tags, just like HTML.

Data Extraction / Cleanup

In order to extract data, the XML document was saved as a CSV. Literally, “File > “Save As.” This is a huge benefit of using flexible mark-up like XML, as opposed to annotation software that can be difficult to extract data from, such as NVivo, which I wrote about here on Columbia University Library’s blog in a description of my methodology. In the original raw, uncleaned file, columns represented encoding variables, and rows represented the encoded text. I cleaned the data to eliminate column redundancies and extraneous blank spaces, as well as to preserve the following variables: place name, chapter and saga of place name, type of place name usage, and place name presence in poetry, prose, or speech. I also re-checked my spelling here, too–next time, no hand-encoding!

Here’s the CSV file after I cleaned it up (it was a mess at first!)

I saved individual CSVs, but also kept related info in an Excel document. One sheet, featured here, was a key to all the variables of my columns, so anyone could decipher my data.

Resulting Metadata:

Once extracted, I geocoded place names using the open-source soft- ware Quantitative Geographic Information Systems (QGIS), which is remarkable similar to ArcGIS except FREE, and was able to accommodate some of those special Icelandic characters I discussed earlier. The resulting geospatial file is called a shapefile, and QGIS allows you to link the shapefile (containing your geospatial data) with a CSV (which contains your metadata). This feature allowed me to link my geocoded points to their corresponding metadata (the CSV file that I’d created earlier, which had place name, its respective saga, all that good stuff) with a unique ID number.

Data Visualization, or THE BIG REVEAL

While QGIS is a powerful and very accessible software, it’s not the most user-friendly. It takes a little time to learn, and I certainly did not expect everyone who might want to see my data would also want to learn new software! To that end, I used the JavaScript library Leaflet to create an interactive map online. You can check it out he re–notice there’s a sidebar that lets you filter information on what type of geographic feature the place name comprises, and pop-ups appear when you click on a place name so you can see how many times it occurs within the three outlaw sagas. Here’s one for country mentions, too.

Click on this image to get to the link and interact with the map.

Takeaways

As the process of this documentation highlights, I feel that working with data is most labor-intensive when it comes to positioning the argument you want your data to make. Of course, actually creating the data by encoding texts and geocoding takes a while too, but so much of the labor in data sets in the humanities is intellectual and theoretical. I think this is a huge strength in bringing humanities scholars towards digital methodologies: the techniques that we use to contextualize very complex systems like literature, fashion, history, motherhood, Taylor Swift (trying to get some class shout-outs in here!) all have a LOT to add to the treatment of data in this digital age.

Thank you for taking the time to read this–and please be sure to let me know if you have any questions, or if I can help you in mapping in any way!

In the meantime, stay tuned for another brief blog post, where I’ll solicit your help for the next stage of this project: visualizing the imaginative components of place name as a corollary to this geographic map.

Joy Report – Data Tech [E]mmersion

It’s good to know your strengths.

I’m never going to be a data dude. Thanks to Stephen Real who turned me onto Lynda.com (forwarded from Matt), I watched several tutorials trying to recreate what Micki shared during her workshop on Thursday, Oct. 31^st.

But, let me back up a moment. Since acknowledging that I’m probably never going to be a data-dude, it occurs to me that my particular strength is as a communicator. To that end, let me share the last two week’s adventures in tech. I have been to EVERY available workshop except the ones on Thursday evenings when I have a previously scheduled class.

This has amounted to six in-person workshops at GC, one FB page, one WordPress site, three online tutorials and an impulsive registration for a Feminist technology course at Barnard (thank you Kelly for referring the info).

Here is what the last month of data-tech-[E]mmersion have looked like:

Tuesday, September 30 – Digital Fellow’s Social Media & Academia: Creating Digital Research Communities Workshop, (Andrew G. MKinney & Laura Kane), Library GC
Friday, October 1 – I wrote a “Twitter” review for the workshop and shared it with my classmates in the DH Praxis 2014 blog site on the Commons.
I also tweaked the Mother Studies webpage on the Commons blog post-workshop
Friday, October 24 – Fellows consultation with Patrick Smyth who showed me Ngram, “Python for kids” workbook, and some other cool things like “Internet Time Machine”, and “Distance Machine.”
Saturday, October 26, blogged about my experience with Patrick, and Ngramed two of my other classes at GC to compare words and texts from a gender perspective; American Studies, and Sociology of Gender.
Art+Feminism Wiki Workshop GC

Monday, October 27 – Wiki Art + Feminism workshop GC – we learned some Wiki code and also found out that only 5% of Wiki contributors are women.
Tuesday, October 28 – WordPress Advanced level users, Library GC. This workshop really helped me see some of the advanced options available to edit my site on the commons. Although these workshops are also frustrating because often we aren’t actually able to try things in the class and its tough to remember everything once you get back to your desk. Workshops should have an additional help session, or follow up lab (or online resource attached to them)
Wednesday, October 29 – Data Mapping for social media, Library GC
Came home that night and built a FB page and blog site called “OurHealthStories.” Thought this might serve as a repository for the big data project and these notes from class. Too much for the DH Blog (Don’t wanna be a “Blog Hog”). I’ve combed through a lot of data sets at this point, and many of them are health related. My own health issues, the state of health care in America today, and recent stories like the one about the creator of the game “Operation” who can’t afford an operation really touched me, and made me want to take action.
Below is a list of the data sites I’ve investigated thus far. I was envisioning a project comparing midwife activity to OBGYN deliveries in America because there is a section of my thesis that would benefit from this. Wrote my advisor.

B-
I have to run a data project for my DH class.
Have you ever, or do you have data on this:
Compare midwife assisted birth to physician assisted birth in US, and data map it.
I want to see the measurable comparisons of how midwives practice relative to doctors. Please let me know if you have anything, also because I want to use it in my thesis paper.
_

She wrote back and advised against it:
“There is a TON of data on this, and it’s kinda complicated. How are you defining midwife? Nurse-Midwife in-hospital? all midwives? all locations, birth centers, hospitals and homes? How are you controlling for maternal status? Just go take a quick look at the literature and you’ll see. I would not encourage you to include this in the thesis — not in this kind of oversimplistic ‘docs’ vs ‘midwives’ way — as I say, WAY too complicated for that.”
__

I wrote a friend of mine who is a public health nurse at Hunter.
She wrote back:
“Here are some sources. Is it by state or national data you need to map? Do you know Google Scholar search?
Here’s a link to a report published in 2012 re Midwifery Births
Here’s a link to an article comparing births MD vesus CNM
National Vital Statistics Report ***** best resource for raw data
CMS Hospital Compare
https://data.medicare.gov/data/hospital-compare
National Center for Health Statistics Vital Data
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
NYC Dept. of Health Data & Statistics http://www.nyc.gov/html/doh/html/data/data.shtml_”

Thursday, October 30 – Data Visualization, (Micki Kaufman), Library GC; Impressive project and great demo. Again, I wish we could have actually tried to do some of the things Micki demoed.
Friday, October 31 – Can’t attend the Fellows open hours this week or next week. Wrote Micki to see if she could meet with me at any point during next week for specific questions/answers? Began to export and clean a data set from last year’s academic MOM Conference, thinking it would be interesting to map the geographic locations attendees hailed from.
Saturday, November 1 – Began the day online taking tutorials. Stephen Real and I met before class on Thursday and he suggested a few things after we discussed how we could create a collaborative project. Today I’m watching Lynda.com videos, but for the tutorials that follow up on where Micki left off on excel documents, I work on a MAC and don’t have a left/right click mouse. So I can’t try a lot of the things they’re demoing. Going to try PDF conversion and scrapping now.
Thursday, Nov. 6 – Stephen Real and I met up. He and I “played” with some data cleaning stuff. He told me about his “Great Expectations” project. Sounds cool. Spoke with Chris Vitale generously shared some of his tech finds (which people have already been writing about here). Stayed late to talk research ideas with Stephen Brier.
Friday, Nov. 7 – Technology and Pedagogy Certificate Program at the Library. We talked wordpress, plug-ins, and sever technology.
Weekend, Nov. 8 – did some research on potential final projects. Explored DH in a Box. I have three ideas. Can’t decide which one to go with. Thinking about creating a survey monkey to ask classmates which idea they like best?

I signed up for “Technologies of Feminism” at Barnard. Starts November 18 and runs for 5 weeks. Here’s what it’s about. Feminism has always been interested in science and technology. Twitter feminists, transgender hormone therapy, and women in STEM are only more recent developments in the long entangled history of tech, science, and gender. And because feminism teaches that technology embodies societal values and that scientific knowledge is culturally situated, it is one of the best intellectual tools for disentangling that history. In this five-week course, we will revisit foundational texts in feminist science studies and contextualize current feminist issues. Hashtag activism and cyberfeminism, feminist coding language and feminized labor, and the eugenic past of reproductive medicine will be among our topics. Readings will include work by Donna Haraway, Maria Fernandez, Lisa Nakamura, Beatriz Preciado and more. Participants of all genders are welcome. No prior knowledge in feminist theory is required.

During the fall 2014 semester, courses similar to this one are taking place across North America in a feminist learning experiment called the Distributed Open Collaborative Course, organized by the international Feminist Technology Network (FemTechNet). As a node in this network, our class will open opportunities for collaboration in online feminist knowledge building—through organizing, content creation, Wikipedia editing, and other means. Together, we will discuss how these technologies might extend the knowledge created in our classroom to audiences and spaces beyond it.

Still haven’t pulled together a comprehensive plan amidst the massive choices available for the data project yet.

WHEW!

I’m en-JOY-ing the journey, but I’m not sure if I can pinpoint a location or product YET. Onward I suppose.

Journaling Data (what else?)

“Day 1”

Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.

I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.

On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.

“Day 2”

So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.

I am going to download Great Expectations from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.

“Day 3”

Stay tuned…

There’s a lot of money in Twitter archives. Also, a Data-Driven Look at #gamergate

On September 3rd, #gamergate was the top trending tag on Twitter. This is an impressive feat for several reasons:

1) It was Beyonce’s birthday. #happybirthdaybeyonce was only the SECOND most popular tag.

2) It was in no way related to mainstream media.

3) It’s not a fun hashtag.

The tag #gamergate refers to a discussion happening between game developers, journalists, and enthusiasts following a series of events Erik Kahn concisely explains in GamerGate: A Closer Look At The Controversy Sweeping Video Games.

To state it briefly, ex-boyfriend of game developer Zoe Quinn wrote a blog post claiming Zoe had slept with members of the press for positive coverage on her new game, Depression Quest. Following this, Zoe gets doxxed, (that is, her public information is released online maliciously), and she begins getting harassed. If it were typical harassment, that would be awful, but she receives several death threats, so it’s even worse. Several members of the gaming community stood up to defend Zoe, and bad things begin to happen to them as well. Phil Fish, developer of indie darling, Fez, and owner of Polytron has his Twitter account hacked and the Polytron website is taken over by hackers. This leads to Fish deleting his Twitter account and declaring that he no longer wishes to be associated with the games industry, or its fans. Dan Golding explains in his piece “The End of Gamers“:

Campaigns of personal harassment aimed at game developers are nothing new. They are dismayingly common among those who happen to be women, or not white straight men, and doubly so if they also happen to make the sort of game that in any way challenge the status quo, even if that challenge is only made through their very existence. The viciousness and ferocity with which this campaign occurred, however, was shocking, and certainly out of the ordinary. This was something more than routine misogyny (and in games, it often is routine, shockingly). It was an ugly spectacle that should haunt and shame those involved for the rest of their lives.

Several other publications chimed in, including Kotaku, Gamasutra, and Polygon. Later that week, Host of gaming vlog, Feminist Frequency posts a piece titled Women as Background Decoration. The threats she receives following this are so direct she is forced from her home.

Despite the large number of publications cited so far, a large bulk of the discussion unfolded on Twitter under the #gamergate hashtag. I have compiled a public archive of several thousand tweets using the tag #gamergate using TAGS here:

Public Archive (Please be patient, it’s a large doc!)

TAGS Explorer: A Visual Representation of the Twitter Conversation (please wait for it to load!)

The conversation was happening so quickly, every time TAGS retrieved Tweets from Twitter, it would freeze and become inaccessible for hours as it compiled the archive. That said, these tweets were collected over the course of two weeks, often hundreds at time, between September 1st and September 10th. Weeks later, and #GamerGate is still receiving roughly 100 tweets per minute. I have met the limit on the size of my archive, but if anyone is well-versed in google docs and spreadsheets, I would love some assistance moving tweets into a new sheet so that I can continue compiling tweets. Please contact me ASAP.

Observations:

•Related hashtag, #notyourshield, appears 1021 times throughout the 4000 tweets archived. #notyourshield is a tag that is supposed to be used by under-represented members of the gaming community, namely women, minorities, and LGBT, who claim that primary video game media outlets discuss the representation poorly, and often in place of the real issues (such as collusion between game developers and press).

•”SJW” appears 375 times throughout 4000 tweets archived. SJW is short for “Social Justice Warrior” a pejorative term for those who defend the rights of under-represented groups online.

•”Journal” (for journalism and journalist) appears 517 times.

• “fem” appears 350 times.

•”Destiny” appears 140 times. Around the time these tweets were compiled, Destiny was on the verge of being released. Many tweets using this tag expressed that the release of Destiny would not slow the discussion circulating around #gamergate. This ended up being true.

•”Quinn” appears 365 times.

•”Phil Fish” appears 45 times.

•”Polygon” appears 475 times. This is the name of a publication that is outspoken against the doxxing of Zoey Quinn, and has many well known and respected female writers on staff. At one point, many were Tweeting that Polygon was banning users on their discussion boards speaking out against Anita Sarkeesian.

•”Kotaku” appears 291 times. Zoe Quinn was implicated for having relations with a writer who worked for this publication.

• 85% of posters identify as male.

•58% of posters are from the US.

• 68.1% of posts are made from a Desktop computer. 14.2% from an iPhone, 10.9% from Android devices.

•64.6% of posts are re-tweets. 12.5% are replies. 23% are original posts

Issues with Archiving:

Using a data tool, I discovered that as of this post over 775,000 tweets have been tagged with #GamerGate. If my 4000 tweets seemed like a large set of data, I am sorry to disappoint. I found a service called HashTracking that would retrieve the full history for me, but it would cost 2.00$ for every 1,000 tweets… so… more money than I have. Another service, Keyhole, offers a real-time view of the hashtag over several social media platforms at once. They also offer historical archives, beginning at 49$. I am currently waiting on them to send me a quote on the cost of my 775k tweet archive. If it’s not over 200$, I will probably suck it up and pay… but I won’t like it.

That said, KEYHOLE IS AMAZING. If you did not click the link to Keyhole above CLICK THIS NOW. Unfortunately, they only offer a 3-day trial; an actual account starts at 130$/mo. You might notice the link I’ve provided dates between August 30th and September 2nd. Because of this, their keyword spread is much different than mine, with the top related tag being #justgamingjournothings.

The biggest problem with archiving this set of tweets is that it’s such a large, unwieldy, and lively set of data. For three weeks it’s been twisting, turning, busy, and relevant. Clearly others have also had difficulty retrieving and dissecting data from Twitter, which is why services like Keyhole and Hashtracking exist, and charge such high rates. Furthermore, as the #gamergate discussion has been going on for roughly three weeks now, looking at a 4000 tweet snapshot of data may not be good enough. In fact, looking at any dataset smaller than the whole thing might not be good enough. My goal is to capture, compile, and dissect the whole event, and because it has taken weeks to unfold, to capture anything smaller than its absolute form would be an injustice… but who has 1500$ to pay for a set of tweets? A big business maybe, not me. Sure, one could work within the limitations of Google Docs, constantly moving the data into new spreadsheets whenever the need arises… but from August 30th and September 2nd, 100,000 tweets were posted. A spreadsheet fills up in 4000, meaning that in 3 days, you would need to edit your script 20 times to take into account new archives, and there’s no doubt that some data would fall between the cracks considering the archival tool would freeze for hours sometimes when updating, therefore archiving a set of data this large using free, easy to use tools would be more than just a full-time job, it would be impossible.

What’s Next?

There are certainly more things that I want to try out with TAGS. I could certainly see how it would be useful to track a smaller conversation, such as our class tag, #dhpraxis14 . I DO plan to find a way to access the entire archive of tweets for #gamergate, as I believe that it might be of some significance later on.

While I do not think this is the place to post my full opinions on the subject, I agree with Dan Golding, who states that the term “gamer” and the connotations behind it are dying. Jesper Juul alludes as early as 2010 in Casual Revolution (which Steven Jones referenced in ch. 1 of his book, which we read this week), whether they be on a mobile devices or on Facebook, that everyone plays games now. We don’t need a word for “gamer” in the same way that we do not need a word for one who reads books or watches films. Creating this kind of terminology has the possibility to create lines of division between those who “do” and those who “do not”. Either way… don’t we need a pristine, complete archive of that? Should I pass around a donation jar?

Digital Praxis Seminar Fall 2014 – Spring 2015