Author Archives: Christopher Vitale

TANDEM Project Update 4.26.15

WEEK 12 TANDEM PROJECT UPDATE:

This has been a week of accelerated achievement on all fronts for TANDEM. Thanks to Steve, we have a working MVP hosted on www.dhtandem.com/tandem. Further, we have also made huge strides on the front end with Kelly’s robust initial set of HTML/CSS pages for the site. While the two ends are not tied together just yet, they are within sight as of this weekend. Jojo continues to surprise the group with her intuitive mix of outreach and awesome having sent out personalized invitations to key members in our contact list and people who have shown interest in the past few months. Keep reading for more detailed information about these and other developments.

DEVELOPMENT / DESIGN:

MVP functionality added this week includes:

  • Ability to upload multiple files
  • Ability to persist data via a sqlite database containing project data and pointers to file locations
  • backend analytic code connected to front end
  • ability to zip and download results

Remaining tasks are:

  • Implement polished UI
  • Implement error handling
  • Handle session management so that simultaneous users keep their data separate
  • Look for opportunities to gain efficiency
  • Correct a small bug in the opencv output
  • Review security, backup file storage approaches and rework as needed to achieve best practices.

OUTREACH:

Continuing to garner community support, Jojo attended a GC Digital Initiatives event Tuesday as well as the English department’s Friday Forum. Additionally, initial invites for the launch went out to the digital fellows and DH Praxis friends and family via paperless post. Digital Fellow Ex Officio Micki Kaufman has already replied that she wouldn’t miss it.  I’m now working to organize outreach with the other teams.

The press release is coming along on the class wiki, too!!

Corpus:

With functionality ironed out, we continue to work with the dataset we have generated via TANDEM for the Mother Goose corpus. As part of our release, we will include work that we have done in both analysis and data visualization for the initial test corpus. If you have questions or points of interest in Mother Goose feel free to comment them below! We are interested in hearing the kinds of questions one might ask of a text/image corpus.

TANDEM Project Update 4.19.15

PROJECT:

DEVELOPMENT:

Steve continues to power away like some sort of half-man, half-robot, mostly magic developer. As of today, we have successfully incorporated a single-file upload functionality to our Django app. Our next action items include:

  • Testing the new upload functionality on the server
  • Implementing multi-file upload and testing on local hosts and our server
  • Hooking up the necessary analytics engine to our Django configuration
  • Adding validation and error checking

UI/UX:

With the technical side of the interactivity mapped, we are working on the mockups for the evolution of the front end. We are working through envisioning each step of the process that a user will experience in the TANDEM front end.

We have begun answering:

  • What are the users met with as a landing page?
  • Are there prompts for users ready to upload their files?
  • As the files upload, what kinds of elements will show what is happening in the backend? (Progress bars, spinners, written prompts)
  • Once the files are completed, what are the users met with?
  • How does a table look with our data fed into it for in-browser?
  • What does the download page look like post-processing?
  • Where and how are the downloads delivered?

In a short time, we will be able to show in full color and depth each of the above.

Giving life to those mockups will be the capstone for our main body of work pre-presentation.

OUTREACH:

This week’s outreach centered around a number of events:

  1. Django NYC MeetUp this Wednesday
    • Geoff Sechter continues to be a valuable resource, though his opinion seems to be that JQuery is our best bet for uploading multiple files. He’s been super patient helping explicate the particularities of Python, as well.
    • Peter Karp of Buzzfeed also had interesting ideas and recommended attending the OpenCV Office Hours hosted at Buzzfeed by Andrew  Kelleher, Adam Kelleher and Katarina Kufieta. Their next meetup is April 21 http://www.meetup.com/NYC-OpenCV-Study-Group/events/221727855/.
  2. Theorizing the Web 2015 at ICP
    • While the many fascinating panels on surveillance did not bear directly on TANDEM, several artists spoke and their work involved text image, including Claudia Pederson @cc3pc and Nicholas Knouf @zeitkunst who work on #artforspooks, and Ben Grosser, @bengrosser, who create #scaremail. Another interesting talk treated Victorian carte du visite as early social media.
    • Spoke more with Erin Glass about potential publicity for TANDEM
  3. The Verge NYC after party @Thoughtworks
    • On Tessa’s invitation, Jojo attended the closing party for the workshop week for innovative design
    • Met John Bruce, Assistant Professor of Strategic Design & Management at the New School, who seemed interested in DH overlap
    • Ran into Hannah Lane who does UX at Thoughtworks — a contact point should things get thorny moving forward with TANDEM UI/UX

TANDEM Project Update

PROJECT:

TANDEM 0.5 will be moving from it’s heavy development phase into a testing and forward-facing design phase this week. At the time of this posting, Steve and Chris are still working out the specifics of functioning unified code, but testing of the independent scripts has begun to a certain degree of success. Text and image values are easily generated via independent processes.

This week we also discussed the idea of data persistence with some depth. Simply put, would someone be able to access the data they generated at a later date via the TANDEM ui. At this iteration of the software, we agree that this is a valuable component, but not an essential feature for an MVP. That said, we are thinking about both the code needed to run it and the user-specific UI that would accompany such an application.

 

DEVELOPMENT:

We are working away at unifying TANDEM’s independently functioning image and text codebases. We are aligning the code in a single python script file. We vetted an idea to have two scripts, one for image and one for text, run simultaneously. The decision was made that for this first iteration of TANDEM, a single .py will suffice, and in fact may be more maintainable and more efficient.

The code merge has been slow due to python versioning issues which lead to the code producing different results on different machines.

A call is scheduled for Tuesday with Tim at Reclaimhosting to work on configuring the server to run Django. Meanwhile the developer is working through the very thorough Django tutorial and also trying to begin the defined appropriate class objects for a potential future version.

 

DESIGN:

Immediately following the code merge, we plan to begin implementing our user interface. A full size mockup of this is still under version control as we explore new grounds with user-specific views and the possibility of in-browser table views of the .csv data that is generated.

 

OUTREACH:

TANDEM continues to reach new communities. We have a lead, thanks to Sarah Cohn. The Biodiversity Library is currently crowdsourcing their seed catalog archive project, and in advanced versions, TANDEM might improve their information collection. http://blog.biodiversitylibrary.org/2015/03/help-us-improve-access-to-seed-and.html Jojo will contact them once our prototype is more stable.

Additionally, Jojo spoke with Grant Wythoff, who reasserted TANDEM’s relevance to Bill Gleason’s project at the Cotsen Library. Jojo will reach out to Professor Gleason again this week. Grant also recommended we contact Natalie Houston at University of Houston regarding her Digital Victorian project on the visuality of poetry.

Jojo also attended the DjangoGirlsNYC event at the Stack Exchange. In addition to familiarizing herself with the framework in which TANDEM will eventually operate, she made useful contact with other django developers working in NYC.

TANDEM: Team Project Update

TANDEM related information now has a home on our Commons page.

Technology Notes Week 4 (via Steve)

To build Tandem, we are utilizing a variety of existing tools. The tools are:

  1. NLTK: We plan to use NLTK to work with language data after it has been OCR’d by Tesseract. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging and parsing.
  2. FeatureExtractor (or QTip): We need a tool that will analyze the features of the images on each page of a corpus. Both of these tools are developed by The Software Studies Lab run by Lev Manovich. While FeatureExtractor is more powerful, QTip is easier to use. We have run into an issue determining the viability of redistributing FeatureExtractor as it relies on proprietary software. We continue to work on whether we can work with this. In the mean time we have scheduled a meeting with Lev to discuss our options.
  3. Tesseract: This is an open source OCR tool provided by Google. We will use this tool to identify text elements on each page of the corpus under analysis. The output of Tesseract will be input to NLTK for analysis. While we continue to lean on Tesseract, our outreach has yielded information about Ocular via USC Berkeley. We have reached out to the programmers behind this alternative OCR option and they have been receptive and helpful. We are vetting this as a possible OCR option.

NLTK

Programming

We are currently writing a Python Script that will pass a directory of files to NLTK for processing.

Status

NLTK install complete on developer’s and PM’s machine. Prototype program completed and tested by developer. Prototype program enhanced by PM to take multiple input files and pass them to NLTK for multiple different analyses.

Next Steps

Enhance script to handle variable number of input files that are output from the OCR Step.

______

QTip

Programming

TBD. QTip does not seem to have an API or a way to launched from a Python Script

Status

Install complete and tested.

Next Steps

Determine if there is a way to utilize QTip from a Python program. A meeting is scheduled with Lev Manovich on March 4 to discuss this.

______

Tesseract

Programming

A prototype module has been created and tested by the developer to process a variable number of PNG files through the Tesseract OCR engine.

Status

Install complete on the devleoper’s computer. A prototype program has been tested that verifies that viability of using Tesseract with PNG files. A significant challenge will be to find methods to improve the poor quality of the OCR output.

Next Steps

We will focus on improving the quality of the output and to test the OCR engine with other types of input files (TIF, JPEG, GIF and BMP are obvious candidates). We also continue to explore alternative OCR engines that may work better than Tesseract.

______

On the outreach front:

  • We continue to start conversations via the #picturebookshare and #tandem hashtags on Twitter.
  • Emails are currently being exchanged on the OCR and image feature extraction front to determine what best practices we can take advantage of.

______

On the design front:

Focusing mainly on our backend functionality, all forward facing design work continues to remain in the outreach department. We are working on playful picture book related logos and marketing materials as well as maintaining a presence in the illustration community. We are not yet at the stage of developing designs for the user interface, but we are beginning to consider the types of functionality, buttons, sliders, etc, that we will incorporate in the final tool.

Suggested Resource – Full Stack Python

In our roaming around the internet, we discovered Full Stack Python written by Matt Makai of Twilio. Take a look at the TOC below for more specific info. Matt does a thorough job of documenting interesting and helpful resources and breaks down more complicated processes into smaller tasks.

He is also very responsive on twitter @mattmakai.

TIL Dropbox and BitTorrent both employ Python in their workflows.

Table of Contents

Every topic below with a link currently has a page on Full Stack Python. If there isn’t a link I’m working on getting a page for that topic up.

TANDEM (0.5)

Let’s build a GUI that combines the power of Google’s Tesseract OCR and FeatureExtractor.

The idea is to build an environment (web-based or standalone) where you can take your text overlayed object, toss it in, and have a save ready output file to take away with you. Generate the data you need to visualize, explore, parse apart, and build the story. There is an interesting dialogue between text and images happening in comics, children’s picture books, marketing materials, illustrated maps, illuminated manuscripts, etc. Get your data, understand the output variables in simple and easy to reference ways, and get back to finding your story.

  • Team Note: Be ready to learn. Everyone involved in this intimidating project will bleed through their role and engage in collaborative learning of each of the elements needed to complete the project. Developers must understand design, designers must understand branding, outreach/branding must understand how the thing works, and the project coordinator must understand how to get conversations rolling to hit deadlines.
  • Developer:  Bravery. Develop clean and clear code that will allow us to wrap our OCR and Image Processing Software as modules to be placed in the overall software. Working understanding of code and willingness to dedicate time to digging into what needs to be written to get this off the ground. Knowledge of Python or a single language at the very least.
  • Designer: Understand user interaction and develop aesthetically simple, intuitive interface. Understand design basics, have a working proficiency in Adobe Design programs. Also maintain brand identity in conjunction with Outreach Coordinator.
  • Outreach Coordinator: Social butterflies. We need community support. Work on creating a voice and an audience for this project. Using not only social media but having the ability to track where our message is working best. What tweets work, what outlets are giving good feedback. We need to make a communal conversation that helps us reach our goal.
  • Project Manager: Keep your hand on the pulse of the schedule, set deadlines, gather learning resources, keep open lines of communication between team members. I have so many people in mind for this and each one can potentially bring an entirely different outcome to the project. I want a project manager who wants to see this thing materialize.

Reassurance – Let’s say we build it in Python and Javascript – Here are key some pieces we can consider:

Google’s Tesseract OCR

Python-tesseract is a python wrapper for google’s Tesseract-OCR

FeatureExtractor (Let’s talk to Lev about this. It is one of his tools afterall.) –

PyJamas GUI Toolkit

#skillset Chris V (@CVDH4)

Here are some key skillsets I can bring to each role.

  • Project Management: Trello and Google Apps are my best friends. Task management, team building, and communication are the guts of what I would bring as project manager. I have professional working experience in the other three positions and know how to create interdepartmental dialogues (namely getting Dev Operations to play nicely with Ad Sales Reps, Editorial, and Marketing.)
  • Developer: This interests me a great deal. I have a basic understanding of Python, Javascript, PHP. I am more proficient in HTML/CSS than anything (Tip of the hat to the days of building custom MySpace pages.) This does however seem like the role that I will learn the most. I want more hands on development experience. It is a stop learning and start creating mentality that drives me here.
  • Design/UX: This is probably my strongest area of experience. I work as a graphic designer and have been involved in many web/mobile design projects. I have a working proficiency with most things made by Adobe including Illustrator, InDesign, Adobe DPS, After Effects, Edge Animate, Edge Code, Dreamweaver, Muse, and Inspect CC. If I was to take on designer I would focus my research on better understanding user interaction, prototyping, and front end development.
  • Outreach: Brand is everything. (I fear working in marketing has ruined me for life.)

I currently work at Queens College in the Center for Teaching and Learning as program assistant as well as a Digital Fellow for the Writing Across the Curriculum Department. That means I have a pool of learning resources that we can tap into, a place to have meetings, and a full media lab at our disposal.

And so it continues…

Taking a scan of where we as a group began and where we stand today, I am enamored with the skills that are developing around me. From Mary Catherine’s awe-inspiring visualization of Icelandic Sagas to Martha Joy’s splintering proposal ideas, this group has evolved into a community of valuable thinkers, but more importantly valuable workers.

While I work through my own project proposal, I find more and more areas where I will need help executing each stage of development. I should be discouraged that the staff and the skillsets that are required for the success of my project is only expanding as I think through it more and more. Instead, I am excited to consider not only closer friends in the course as assets, but also people who I have yet to really chat with one on one as potential teammates.

When NYPL went around the room asking us about what we were working on and what we were going to propose, I have to admit, I went with a lame cop out answer. I hadn’t had the heart to blurt out what I really was thinking of proposing. Instead I went with some idea about a content series or something of that sort.

I have gone with a much more exciting proposition. It involves not only the study of an unexplored corpus, but also the development of a new platform for studying a particular type of media. I will explain more in my presentation tomorrow, but I thought this would be a good time and outlet to reflect on our group, our growth, and our future together.

Thank you all for being such a fantastic, collaborative, and thought-provoking amalgamation of personalities, minds, backgrounds, and insight.

Tay Sway by the Numbers

DOES A POP STAR’S LEXICON WAX OR WANE WITH FAME?

What happens when you juxtapose the lyrics of Taylor’s self-titled debut album from 2006 with those from her album “1989”, the chart topping, million-copies-in-a-week latest album?

This is an extremely (valiant attempt at an) academic exploration of Taylor Swift’s first and latest albums.

A Quick Overview: The lyrics were pulled from the AZ Lyrics. The raw text files were cleaned using the free text editor TextWrangler for Mac. All punctuation, extra spacing, and special characters were removed. As a basic entry point to NLP, I have employed Voyant-Tools.org, the web-based reading and analysis environment for digital texts, to give some numeric values to pieces of the text. Best of all, it’s all compiled on it’s very own Commons site.

I analyze, visualize, explore, document, and set free the Tay Sway Corpus here:

https://taysway.commons.gc.cuny.edu/

All the data has been made live so you too can play with Taylor Swift lyrics in an academic setting.Taylor Swift Speak Now - Pittsburgh