Team Members and Roles
- Stephen Real – Developer
- Kelly Blanchat – Designer/UI/UX
- Jojo Karlin – Outreach Coordinator
- Christopher Vitale – Project Manager
TANDEM 0.5 is a standalone desktop environment that generates image and text metadata from files (.jpeg, .gif, .tiff, .pdf etc.) submitted by the user. This output is intended to be used as input for data visualization, quantitative analysis, and distant reading of multimodal print objects. TANDEM will compile existing open source technologies including a version of Tesseract OCR, Feature Extractor, and light natural language processing packages to generate useful output. The output will be concatenated into a single document that can be saved in .TXT, .CSV file formats.
To explore the functionality of TANDEM, we will employ a test corpus of Public Domain picture books, acquired from either HathiTrust or Project Gutenberg. The test corpus will illustrate that TANDEM streamlines the ability to generate the kinds of data needed to make informed distant readings of multimodal print artifacts. TANDEM has an intended audience of scholars with a range of computational expertise and a need for quantitative insight into picture books, comics, illuminated manuscripts, and other images with overlaid text.
Very brief environmental scan
What problem does this solve?
In the expanding field of digital humanities scholarship and popular data visualization, demand for data is rapidly increasing. A number of tools are available that extract data from images and that digitize and retrieve information from non-digital text. As the possibilities of data visualization gain currency, the types of applications expand. Our visual culture calls for an extractor that can approach the visual elements of text.
What lacuna does it fill?
TANDEM brings two approaches together in one place. While many open source softwares respond to the call for manipulable data from text and images, there is no place for analyzing those texts that heavily involve both (picture books, advertisements, illuminated manuscripts). TANDEM aims to resolve the disconnect between study of these two print elements by culling useable text and image data simultaneously. TANDEM also proposes to develop a graphical user interface (GUI) that makes data processing more accessible to the humanists and lay people interested in examining the properties of words and pictures.
What other projects are there?
Currently, there are a number of tools dedicated to image extraction. TAPoR at University of Alberta, serves as a gateway to the tools used in sophisticated text analysis and retrieval. Lev Manovich’s ImagePlot is a free software tool that visualizes collections of images and video of any size. Manovich’s Time Magazine shows the possibilities for visualizations with this image data. Mathworks has developed Feature Extraction to represent parts of an image as a compact feature vector. For optical character recognition (OCR), there are both open source options — Google’s Tesseract OCR, Adobe Acrobat, Evernote— and purchasable programs — Omnipage, ABBYY FineReader, Canon’s Readiris, and Prizmo. TANDEM will merge the basic functions of these sorts of tools to provide a unified data compiler. It will offer pipelined access to open source data visualization output options along the lines of Voyant, Wordle, or Gephi either within modules or as click-to functions from the standalone application.
What technologies will be used?
Our project will be a desktop GUI application built in Python. We will leverage API’s to utilize Tesseract (OCR processing) and FeatureExtractor (image processing) to produce the TANDEM output, which will be in the form of a CSV download, a simple database structure, or both. Elements of the UI will be designed in Adobe Design Suite. The project will be promoted via Twitter and a Commons site.
Which of these are known?
Chris and Stephen have working Python experience, while Kelly and Jojo are novice level programmers. To date, the majority of the TANDEM team has not used Tesseract or FeatureExtractor in their work. Chris has lightly tested FeatureExtractor in the lead up to this semester’s possible work. Chris has a professional working proficiency in Adobe Design Suite software and design for the Web. Kelly has expertise in the Adobe Creative Suite as well, and working understanding of designing for the web. The entire TANDEM team is familiar with Twitter and the Commons platform for outreach.
Which need to be learned?
Although Chris and Stephen understand the Python syntax, the whole TANDEM team will be conducting some hands-on learning of various packages, such as Pyjamas for GUI building. To further Python group-education, we will utilize existing resources within the Graduate Center, such as the Digital Fellows’ workshops, access to Lynda via the library, and by reaching out to Lev Manovich. Additional information will be obtained from the Python community on Twitter and Stackoverflow.
What’s the plan to learn them? What support is needed?
The plan to learn them is multi-faceted. Python has a rich universe of tutorials, books and web-documentation, which has already begun to be used. Tesseract and FeatureExtractor are intended for use by non-programmers and also provide user-guides and FAQ’s and other documentation, here: https://code.google.com/p/softwarestudies/wiki/FeatureExtractor and here: https://code.google.com/p/tesseract-ocr/.
The team plans to leverage the Digital Fellows during office hours and/or by appointment. As with most technologies, the learning curve is difficult to predict. It may be that there will be points in time when we get stuck and will need more support than the Digital Fellows can provide.
How will the project be managed?
We have organized our workflow across the Trello Task Management platform. The larger research points, development steps, and deliverables are broken into tickets for each member of the team. We are employing Google Drive as a place to organize docs and sheets containing important information that we are tracking. GitHub will help us maintain a version history of the codebase. Each member of the team will bleed roles while maintaining ownership of each of their individual subsets of the project.
Milestones (including dates of deliverables)
|Application Dependencies Defined and Compiled||2/24|
|Required User Interface Elements Defined and Designed||3/10|
|Functional Specifications Validated||3/17|
|Text Metadata Functionality||3/31|
|Image Metadata Functionality||4/7|
|Combined Text/Image Functionality||4/14|
|Quality Assurance Testing/Debugging||5/5|
|Build Package to Deploy TANDEM||5/12|
|Test Corpus Analysis via TANDEM generated data||5/19|