TANDEM related information now has a home on our Commons page.
Technology Notes Week 4 (via Steve)
To build Tandem, we are utilizing a variety of existing tools. The tools are:
- NLTK: We plan to use NLTK to work with language data after it has been OCR’d by Tesseract. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging and parsing.
- FeatureExtractor (or QTip): We need a tool that will analyze the features of the images on each page of a corpus. Both of these tools are developed by The Software Studies Lab run by Lev Manovich. While FeatureExtractor is more powerful, QTip is easier to use. We have run into an issue determining the viability of redistributing FeatureExtractor as it relies on proprietary software. We continue to work on whether we can work with this. In the mean time we have scheduled a meeting with Lev to discuss our options.
- Tesseract: This is an open source OCR tool provided by Google. We will use this tool to identify text elements on each page of the corpus under analysis. The output of Tesseract will be input to NLTK for analysis. While we continue to lean on Tesseract, our outreach has yielded information about Ocular via USC Berkeley. We have reached out to the programmers behind this alternative OCR option and they have been receptive and helpful. We are vetting this as a possible OCR option.
We are currently writing a Python Script that will pass a directory of files to NLTK for processing.
NLTK install complete on developer’s and PM’s machine. Prototype program completed and tested by developer. Prototype program enhanced by PM to take multiple input files and pass them to NLTK for multiple different analyses.
Enhance script to handle variable number of input files that are output from the OCR Step.
TBD. QTip does not seem to have an API or a way to launched from a Python Script
Install complete and tested.
Determine if there is a way to utilize QTip from a Python program. A meeting is scheduled with Lev Manovich on March 4 to discuss this.
A prototype module has been created and tested by the developer to process a variable number of PNG files through the Tesseract OCR engine.
Install complete on the devleoper’s computer. A prototype program has been tested that verifies that viability of using Tesseract with PNG files. A significant challenge will be to find methods to improve the poor quality of the OCR output.
We will focus on improving the quality of the output and to test the OCR engine with other types of input files (TIF, JPEG, GIF and BMP are obvious candidates). We also continue to explore alternative OCR engines that may work better than Tesseract.
On the outreach front:
- We continue to start conversations via the #picturebookshare and #tandem hashtags on Twitter.
- Emails are currently being exchanged on the OCR and image feature extraction front to determine what best practices we can take advantage of.
On the design front:
Focusing mainly on our backend functionality, all forward facing design work continues to remain in the outreach department. We are working on playful picture book related logos and marketing materials as well as maintaining a presence in the illustration community. We are not yet at the stage of developing designs for the user interface, but we are beginning to consider the types of functionality, buttons, sliders, etc, that we will incorporate in the final tool.