Tag Archives: Data

“Playing” with Images

I was very taken with Lev Manovich’s article, “How to Compare One Million Images?”, on image visualization that dealt with ImagePlot and its use in his project, although at that time I wasn’t thinking of using it the dataset “play” project. I am a visually driven person, and spend quite a bit of time playing around with images. Similar to those who relax with books, I curl up with images, and spend a lot of time gazing at pictures. And, also an almost equal amount of time searching for them. So, with my new-found awareness of data, I began wondering if my preferences could be quantified, and use the resultant measures as search criteria?

So with the dataset project in mind, I went back to Manovich’s article and read it again to get details, which directed me to the Software Studies website to download ImageJ. I then downloaded the macro, ImagePlot, required for image visualization. After installing it in ImageJ, I set about finding its requirements for visualization from the software documentation. All that ImagePlot required was an image collection with associated metadata. I put together a set of 135 images from my personal collection after sifting through 600 odd images. I took particular care to include only those that I really liked, so the results would be meaningful.

As ImagePlot automatically scales the images to an uniform size, it was enough to just pull all the pictures together into a single folder. (ImagePlot documentation does mention that such a step is not required, as it is capable of handling images stored at different locations in a computer.)

Now that I had the image set in place, I went back to the documentation to know what format was required of the metadata, which happened to be ‘delimited tab text’. At first, assuming the metadata had to manually assembled, I spent some time creating a trial file for 20 images in that format. Once it became apparent this would be time consuming, I went back to the documentation and came to know that ImageJ does ‘batch’ (measuring multiple images in one step) image processing and measuring, the results of which are stored as a .csv file by default. Just choose the features that are to be measured (image brightness, gray values, etc.), click on ‘measure’ and, in one stroke, metadata appropriate for the image visualization is created by ImageJ itself! Overjoyed and very appreciative of ImageJ, I proceeded to convert this .csv file to the ‘delimted tab’ .txt format in Excel and was finally all set to go.

Snapshot of Metadata in Delimited Tab .txt format

I chose to measure mean gray value (y-axis) and intensity (x-axis) of the images and plotted the values with the following results.

Through the visualization, I was able to see the range of gray values and intensity my images possessed. It seems I prefer images that are bright with less grayness, and of moderate intensity. Most of the images are of medium to low gray values, with very few in the high gray and high intensity category. The lines link images of similar characteristics and show how the images relate to each other.

As a next step, I intend to pursue animated visualizations now that I’m familiar with the visualization process. The biggest revelation for me was the documentation that accompanied the software. I’d always assumed that answers had to be found elsewhere from knowledgeable users, but most of my questions were answered by the documentation itself. Worked out sample projects that accompanied the software were helpful as well. These resources gave me the confidence to approach the project and fix errors in processing. Also, understanding data formats and creating metadata for the images were equally empowering.

So, going back to my earlier question – can my image preferences be quantified? Yes. But, I am yet to figure out how to use these values as search criteria for image collections. That, is where I go from here.

ImagePlot and ImagePlot Documentation can be found here –

http://lab.softwarestudies.com/p/imageplot.html

https://docs.google.com/document/d/1zkeik0v2LJmi1TOK4OxT7dVKJO7oCmx_fNP8SYdTG-U/edit?hl=en_US&pli=1#

Journaling Data (what else?)

“Day 1”

Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.

I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.

On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.

“Day 2”

So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.

I am going to download Great Expectations from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.

“Day 3”

Stay tuned…

Mapping Data: Workshop 3/3

Hi all,

Just to follow up on Mary Catherine’s post about finding data, I wanted to recap the final session of this workshop series that took place tonight.

The library guide on mapping data (by Margaret Smith) can be found here: http://libguides.gc.cuny.edu/mappingdata

As in the other two workshops, Smith emphasized thinking about who would be keeping this data and why as a part of the critical research process. It’s especially interesting given the size of these data sets and maps, meaning that the person (or corporate entity, NGO, or government agency) likely has a very specific reason for hosting this information.

She brought us through a few examples from basic mapping sites, like the NYT’s “Mapping America” which pulls on 2005-2009 Census Bureau data, to basic mapping applications like Social Explorer (the free edition has limited access, but the GC has bought full access) and the USGS and NASA mapping applications. The guide also includes a few more advanced mapping options, like ArcGIS, but the tool that seemed most useful to me, in the short-term anyway, is Google’s Fusion Tables, which allows you to merge data sets that have terms in common. The example Smith used was a data set of demographic data (her example was percentage of minority students) organized by town name (her example was towns in Connecticut) and a second set of data that defined geographic boundaries by the same set of towns. Fusion Tables then lets you map the demographic data and select various ways to visualize and customize your results.

My main takeaway from this series was that each of these tools is highly particular and unique, and you have to really dig into playing with the individual system before you’ll even know if it is the right tool for your work.

That, and also learn R.

Finding Data: Preliminary Questions

Hello, all,

As promised, here’s a link to the “Finding Data” library guide on the Mina Rees Library site. Apologies if someone has posted it already!

http://libguides.gc.cuny.edu/findingdata

The guide was created by the wonderful Margaret Smith, an adjunct librarian at the GC Library who is teaching the workshops on data for social research. There’s one more–Wednesday, 6:30-8:30pm downstairs in the library in one of the computer labs–and I’m sure she’d be happy to have anyone swing by. Check out the Library’s blog for details.

Within this guide, the starting questions that Smith provides, in order to get you thinking of your dataset theoretically as well as practically, are very helpful–and I wish I had them years ago! Here are some highlights, taken directly from the guide (but you should really click through!):

HOW TO FIND DATA:

When searching for data, ask yourself these questions…

Who has an interest in collecting this data?

If federal/state/local agencies or non-governmental organizations, try locating their website and looking for a section on research or data.
If social science researchers, try searching ICPSR.

What literature has been written that might reference this data?

Search a library database or Google Scholar to find articles that may have used the data you’re looking for. Then, consult their bibliographies for the specific name of the data set and who collected it.

HOW TO CONSIDER USING IT:

Is the data…

From a reliable source? Who collected it and how?
Available to the public? Will I need to request permission to use it? Are there any terms of use? How do I cite the data?
In a format I can use for analysis or mapping? Will it require any file conversion or editing before I can use it?
Comparable to other data I’m using (if any)? What is the unit of analysis? What is the time scale and geography? Will I need to recode any variables?

And another thought that I really loved from her first workshop in this series:

Consider data as an argument.

Since data is social, what factors go into its production? What questions does the data ask? And how do the answers to these questions, as well as the questions above, affect the ways in which that dataset can shed light on your research questions?

All fantastic stuff–looking forward to seeing more of these data inquiries as they pop up on the blog!

(again, all bulleted text is from the “Finding Data” Lib Guide, by Meg Smith, Last Updated Oct. 15 2014. http://libguides.gc.cuny.edu/findingdata)

Datasets

Here is the link to New York City’s open data: https://data.cityofnewyork.us

Data vs. Capta

Hello, all,

Here is Johanna Drucker’s piece, “Humanities Approaches to Graphical Display,” that I mentioned in class.

In her abstract, Drucker argues that ” the concept of data as a given has to be rethought through a humanistic lens and characterized as capta, taken and constructed.” In doing so, I feel that she helps to establish what is unique about humanities computing–that it is not computer science and humanistic study on different sides of the same coin, but rather an integration of concepts from both disciplines.

http://www.digitalhumanities.org/dhq/vol/5/1/000091/000091.html

And so it begins.

In May of 2013, I graduated Queens College. I spent a small fortune, pennies compared to most, to receive a piece of paper that gave validity to my ramblings about how cool I thought Chaucer was/is. I got a degree in English. I must be crazy. Shortly after, I got a job at a magazine. The problem was it wasn’t in the editorial department.

That fall, I began working as a Digital Media Planner a decent sized publishing company. The reality of digital publication soon came to light. It was my responsibility to develop online advertising strategies for blue-chip brands looking to hit wealthy middle-aged men and women or intelligent millennial thought-starters across our sites.

What is the exact frequency and quantity of annoying flashing advertisements we can throw at users before they stop coming back?

Moreover, how much money can we make off the millions of branded pictures, animations, and videos we position next to our content? We are only shooting for a click-through rate of 0.05% (about 5 out of every 10,000 ads).

Needless to say, in a few short months I began to feel anxious about the amount of information ad servers can gather on users online habits. While shopping data management platforms for our sites, we heard promises of user profile optimization that would create content and ad experiences specific to a particular person. My web is different than your web.

Omnichannel personalization. Behavioral data. Interest profiles. Purchase histories.

Suddenly, I was hyper aware of how fabricated the thin veneer of the web really was. Most of us interact with a variety of publications on a daily basis, hitting a top tier of social networks along the way, and reading highly curated content that caters to our need to digest quickly and move on.

What I also realized was the power of data behind this scheme. There are entire industries pivoted on gathering, sourcing, organizing, analyzing, and visualizing these enormous pools of information.

It is much easier to sell a brand an ad campaign when they know their target demographic is exactly who is going to see it.

I began falling in love with data. Call it a complex, but it was essentially a game to see if I was able to make these half-million dollar campaigns could work. I spent hours analyzing user flows, traffic rates, and article statistics.

Visualizations began to tell stories. Charts and infographics were as nuanced as poetry.

I hated the job, but I loved the data.

I come to digital humanities with this love of data and degree in literature. Independently rooted, I hope to unite these two spheres and find common ground this year.

And so it begins.

Digital Praxis Seminar Fall 2014 – Spring 2015

Tag Archives: Data

“Playing” with Images

Journaling Data (what else?)

Mapping Data: Workshop 3/3

Finding Data: Preliminary Questions

Datasets

Data vs. Capta

And so it begins.

Need help with the Commons?