TANDEM PROJECT PLAN

Team Members and Roles

Stephen Real – Developer
Kelly Blanchat – Designer/UI/UX
Jojo Karlin – Outreach Coordinator
Christopher Vitale – Project Manager

Abstract

TANDEM 0.5 is a standalone desktop environment that generates image and text metadata from files (.jpeg, .gif, .tiff, .pdf etc.) submitted by the user. This output is intended to be used as input for data visualization, quantitative analysis, and distant reading of multimodal print objects. TANDEM will compile existing open source technologies including a version of Tesseract OCR, Feature Extractor, and light natural language processing packages to generate useful output. The output will be concatenated into a single document that can be saved in .TXT, .CSV file formats.

To explore the functionality of TANDEM, we will employ a test corpus of Public Domain picture books, acquired from either HathiTrust or Project Gutenberg. The test corpus will illustrate that TANDEM streamlines the ability to generate the kinds of data needed to make informed distant readings of multimodal print artifacts. TANDEM has an intended audience of scholars with a range of computational expertise and a need for quantitative insight into picture books, comics, illuminated manuscripts, and other images with overlaid text.

Very brief environmental scan

What problem does this solve?

In the expanding field of digital humanities scholarship and popular data visualization, demand for data is rapidly increasing. A number of tools are available that extract data from images and that digitize and retrieve information from non-digital text. As the possibilities of data visualization gain currency, the types of applications expand. Our visual culture calls for an extractor that can approach the visual elements of text.

What lacuna does it fill?

TANDEM brings two approaches together in one place. While many open source softwares respond to the call for manipulable data from text and images, there is no place for analyzing those texts that heavily involve both (picture books, advertisements, illuminated manuscripts). TANDEM aims to resolve the disconnect between study of these two print elements by culling useable text and image data simultaneously. TANDEM also proposes to develop a graphical user interface (GUI) that makes data processing more accessible to the humanists and lay people interested in examining the properties of words and pictures.

What other projects are there?

Currently, there are a number of tools dedicated to image extraction. TAPoR at University of Alberta, serves as a gateway to the tools used in sophisticated text analysis and retrieval. Lev Manovich’s ImagePlot is a free software tool that visualizes collections of images and video of any size. Manovich’s Time Magazine shows the possibilities for visualizations with this image data. Mathworks has developed Feature Extraction to represent parts of an image as a compact feature vector. For optical character recognition (OCR), there are both open source options — Google’s Tesseract OCR, Adobe Acrobat, Evernote— and purchasable programs — Omnipage, ABBYY FineReader, Canon’s Readiris, and Prizmo. TANDEM will merge the basic functions of these sorts of tools to provide a unified data compiler. It will offer pipelined access to open source data visualization output options along the lines of Voyant, Wordle, or Gephi either within modules or as click-to functions from the standalone application.

What technologies will be used?

Our project will be a desktop GUI application built in Python. We will leverage API’s to utilize Tesseract (OCR processing) and FeatureExtractor (image processing) to produce the TANDEM output, which will be in the form of a CSV download, a simple database structure, or both. Elements of the UI will be designed in Adobe Design Suite. The project will be promoted via Twitter and a Commons site.

Which of these are known?

Chris and Stephen have working Python experience, while Kelly and Jojo are novice level programmers. To date, the majority of the TANDEM team has not used Tesseract or FeatureExtractor in their work. Chris has lightly tested FeatureExtractor in the lead up to this semester’s possible work. Chris has a professional working proficiency in Adobe Design Suite software and design for the Web. Kelly has expertise in the Adobe Creative Suite as well, and working understanding of designing for the web. The entire TANDEM team is familiar with Twitter and the Commons platform for outreach.

Which need to be learned?

Although Chris and Stephen understand the Python syntax, the whole TANDEM team will be conducting some hands-on learning of various packages, such as Pyjamas for GUI building. To further Python group-education, we will utilize existing resources within the Graduate Center, such as the Digital Fellows’ workshops, access to Lynda via the library, and by reaching out to Lev Manovich. Additional information will be obtained from the Python community on Twitter and Stackoverflow.

What’s the plan to learn them? What support is needed?

The plan to learn them is multi-faceted. Python has a rich universe of tutorials, books and web-documentation, which has already begun to be used. Tesseract and FeatureExtractor are intended for use by non-programmers and also provide user-guides and FAQ’s and other documentation, here: https://code.google.com/p/softwarestudies/wiki/FeatureExtractor and here: https://code.google.com/p/tesseract-ocr/.

The team plans to leverage the Digital Fellows during office hours and/or by appointment. As with most technologies, the learning curve is difficult to predict. It may be that there will be points in time when we get stuck and will need more support than the Digital Fellows can provide.

How will the project be managed?

We have organized our workflow across the Trello Task Management platform. The larger research points, development steps, and deliverables are broken into tickets for each member of the team. We are employing Google Drive as a place to organize docs and sheets containing important information that we are tracking. GitHub will help us maintain a version history of the codebase. Each member of the team will bleed roles while maintaining ownership of each of their individual subsets of the project.

Milestones (including dates of deliverables)

Deliverable	Due Date
Application Dependencies Defined and Compiled	2/24
Required User Interface Elements Defined and Designed	3/10
Functional Specifications Validated	3/17
Text Metadata Functionality	3/31
Image Metadata Functionality	4/7
Combined Text/Image Functionality	4/14
Code/GUI Completion	4/21
Quality Assurance Testing/Debugging	5/5
Build Package to Deploy TANDEM	5/12
Test Corpus Analysis via TANDEM generated data	5/19

2 thoughts on “TANDEM PROJECT PLAN”

Luke Waltzer (he/him) February 16, 2015 at 2:00 pm

Impressive work. I’m excited for this to get more specific, and am curious what questions you’ll be asking about the relationships between images and text. Again: identifying a test corpus needs to be a priority. I also think you need to speak with Lev sooner than later, as he will be able to tell you more than I can about the accuracy of your environmental scan.

Why desktop rather than hosted?
Amanda Hickman February 16, 2015 at 4:49 pm

Folks,
I’m glad to see all of these projects coming together. Please keep in mind that an abstract is abstract. It isn’t the place for specifics about what kind of file types you plan to accommodate or produce. You’ve got the phrase “distant reading of multimodal print artifacts” (or objects) on repeat but I’d like more clarity about what you are doing. When we ask “what lacuna does this fill” we don’t mean, what gaps between existing software, but what needs. Who needs this? Why? What are they going to do with it? What is hard that Tandem will make easy? “People want more data” isn’t a problem that you should be putting energy into solving, and I don’t think that’s really what you see as the problem you’re solving. I think it is filler.

I also want to caution you against getting too focused on a desktop GUI. Maybe this is something that lives on server, runs in a browser. Leave that possibility open.

This is a great project and I’m looking forward to seeing it come together!

Comments are closed.

Digital Praxis Seminar Fall 2014 – Spring 2015