Tag Archives: Data

Journaling Data (what else?)

“Day 1”

Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.

I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.

On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.


“Day 2”

So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.

I am going to download Great Expectations  from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.

“Day 3”

Stay tuned…

Mapping Data: Workshop 3/3

Hi all,

Just to follow up on Mary Catherine’s post about finding data, I wanted to recap the final session of this workshop series that took place tonight.

The library guide on mapping data (by Margaret Smith) can be found here: http://libguides.gc.cuny.edu/mappingdata

As in the other two workshops, Smith emphasized thinking about who would be keeping this data and why as a part of the critical research process. It’s especially interesting given the size of these data sets and maps, meaning that the person (or corporate entity, NGO, or government agency) likely has a very specific reason for hosting this information.

She brought us through a few examples from basic mapping sites, like the NYT’s “Mapping America” which pulls on 2005-2009 Census Bureau data, to basic mapping applications like Social Explorer (the free edition has limited access, but the GC has bought full access) and the USGS and NASA mapping applications. The guide also includes a few more advanced mapping options, like ArcGIS, but the tool that seemed most useful to me, in the short-term anyway, is Google’s Fusion Tables, which allows you to merge data sets that have terms in common. The example Smith used was a data set of demographic data (her example was percentage of minority students) organized by town name (her example was towns in Connecticut) and a second set of data that defined geographic boundaries by the same set of towns. Fusion Tables then lets you map the demographic data and select various ways to visualize and customize your results.

My main takeaway from this series was that each of these tools is highly particular and unique, and you have to really dig into playing with the individual system before you’ll even know if it is the right tool for your work.

That, and also learn R.

Finding Data: Preliminary Questions

Hello, all,

As promised, here’s a link to the “Finding Data” library guide on the Mina Rees Library site. Apologies if someone has posted it already!


The guide was created by the wonderful Margaret Smith, an adjunct librarian at the GC Library who is teaching the workshops on data for social research. There’s one more–Wednesday, 6:30-8:30pm downstairs in the library in one of the computer labs–and I’m sure she’d be happy to have anyone swing by. Check out the Library’s blog for details.

Within this guide, the starting questions that Smith provides, in order to get you thinking of your dataset theoretically as well as practically, are very helpful–and I wish I had them years ago! Here are some highlights, taken directly from the guide (but you should really click through!):


When searching for data, ask yourself these questions…

Who has an interest in collecting this data?

  • If federal/state/local agencies or non-governmental organizations, try locating their website and looking for a section on research or data.
  • If social science researchers, try searching ICPSR.

What literature has been written that might reference this data?

  • Search a library database or Google Scholar to find articles that may have used the data you’re looking for. Then, consult their bibliographies for the specific name of the data set and who collected it.


Is the data…

  • From a reliable source? Who collected it and how?
  • Available to the public? Will I need to request permission to use it? Are there any terms of use? How do I cite the data?
  • In a format I can use for analysis or mapping? Will it require any file conversion or editing before I can use it?
  • Comparable to other data I’m using (if any)? What is the unit of analysis? What is the time scale and geography? Will I need to recode any variables?

And another thought that I really loved from her first workshop in this series:

Consider data as an argument.

Since data is social, what factors go into its production? What questions does the data ask? And how do the answers to these questions, as well as the questions above, affect the ways in which that dataset can shed light on your research questions?

All fantastic stuff–looking forward to seeing more of these data inquiries as they pop up on the blog!

(again, all bulleted text is from the “Finding Data” Lib Guide, by Meg Smith, Last Updated Oct. 15 2014. http://libguides.gc.cuny.edu/findingdata)


Data vs. Capta

Hello, all,

Here is Johanna Drucker’s piece, “Humanities Approaches to Graphical Display,” that I mentioned in class.

In her abstract, Drucker argues that ” the concept of data as a given has to be rethought through a humanistic lens and characterized as capta, taken and constructed.” In doing so, I feel that she helps to establish what is unique about humanities computing–that it is not computer science and humanistic study on different sides of the same coin, but rather an integration of concepts from both disciplines.


And so it begins.

In May of 2013, I graduated Queens College. I spent a small fortune, pennies compared to most, to receive a piece of paper that gave validity to my ramblings about how cool I thought Chaucer was/is. I got a degree in English. I must be crazy. Shortly after, I got a job at a magazine. The problem was it wasn’t in the editorial department.

That fall, I began working as a Digital Media Planner a decent sized publishing company. The reality of digital publication soon came to light. It was my responsibility to develop online advertising strategies for blue-chip brands looking to hit wealthy middle-aged men and women or intelligent millennial thought-starters across our sites.

What is the exact frequency and quantity of annoying flashing advertisements we can throw at users before they stop coming back?

Moreover, how much money can we make off the millions of branded pictures, animations, and videos we position next to our content? We are only shooting for a click-through rate of 0.05% (about 5 out of every 10,000 ads).

Needless to say, in a few short months I began to feel anxious about the amount of information ad servers can gather on users online habits. While shopping data management platforms for our sites, we heard promises of user profile optimization that would create content and ad experiences specific to a particular person. My web is different than your web.

Omnichannel personalization. Behavioral data. Interest profiles. Purchase histories.

Suddenly, I was hyper aware of how fabricated the thin veneer of the web really was. Most of us interact with a variety of publications on a daily basis, hitting a top tier of social networks along the way, and reading highly curated content that caters to our need to digest quickly and move on.

What I also realized was the power of data behind this scheme. There are entire industries pivoted on gathering, sourcing, organizing, analyzing, and visualizing these enormous pools of information.

It is much easier to sell a brand an ad campaign when they know their target demographic is exactly who is going to see it.

I began falling in love with data. Call it a complex, but it was essentially a game to see if I was able to make these half-million dollar campaigns could work. I spent hours analyzing user flows, traffic rates, and article statistics.

Visualizations began to tell stories. Charts and infographics were as nuanced as poetry.

I hated the job, but I loved the data.

I come to digital humanities with this love of data and degree in literature. Independently rooted, I hope to unite these two spheres and find common ground this year.

And so it begins.