Digital HUAC: MVP Post

Over the course of this project so far, and in relation to the feedback that we’ve been receiving, we have scaled up and down our goals and expectations. It has been both humbling and empowering to consider everything we can do within the constraints of a single semester project. When asked to brainstorm our minimum viable product (MVP) this week, over a conference call we all agreed on the following:

– a central repository with basic search functionality that stores our corpus of 5 transcripts.

– a database that can be scaled.

What does this mean, and how does it differ from our current project goals?

We are attempting to generate a platform that connects a relational database to a robust search interface and utilizes an API to allow users to extract data. We envision Digital HUAC to be the start of a broader effort to organize HUAC transcripts and allow researchers and educators access to their every character. By allowing advanced searches driven by keywords and categories, we seek to allow users to drill down into the text of the transcripts.

Our MVP focuses on storing the transcripts in a digital environment that returns simple search results: in absence of a robust search mechanism, users would instead receive results indicating, for example, that a sought after term appeared in a given transcript and not much more.

Our MVP must lay out a model for a scalable database. We are still very much figuring out exactly how our database will operate, so it is hard to fully commit to what even a pared-down version of this would look like. But we know that the MVP version must work with plain text files as the input and searchable files as the output.

Generating an MVP has been a useful thought experiment. It has forced us to hone in on the twin narrative and technical theses of this project: essentially, if everything else was stripped away, what must be left standing. For us, this means providing basic search results and a working model of a relational database that, given appropriate time and resources, could be expanded to accommodate a much greater corpus.

TANDEM Project Update Week 5

TANDEM_big logo

Excitement!

Team TANDEM is working fast and furiously on all fronts. We’ve hit a few snags but all told, we feel like we’ve got a handhold on the mountains we’re climbing. Here’s a brief overview of the ups and downs of the week:

  • Our hope we might springboard off Lev’s tool proved somewhat castles in the air. Lev’s feature extractor was coded in a day. When they went to try to run it again later they couldn’t. Lev suggested we use OpenCV instead.
  • OpenCV seems to be a massive and constantly shifting morass of dependencies.
  • Jojo attended a couple of talks from Franco Moretti and spoke with him afterwards to see if anyone at Stanford was doing anything similar. While he acknowledged the validity of studying text as image, he seems to show no further interest. Bummer, but his loss.

The Details

In development we’ve got a working program for OCR and NLTK (Go Steve!), and we’re making strides in OpenCV (Go Chris!). Lev suggested that we have two different types of picture books for our test corpus — one that’s rich in color, another that’s more gray-scale with more text. These corpus variations will show the range of data values available to future users of TANDEM. Kelly’s working on scanning an initial test corpus now.

BookScanCenter_8

We also have our hosting set up with Reclaim thanks to Tim, as well as a forwarding email domain. Go ahead and send us an email to [email protected].

In design/UI/UX Kelly has been working on variations of a brand identity, for color schemes, logo, web design elements… all of it (Go Kelly!). TANDEM_Circle GlyphsThe UI is primed to go now that we have hosting for TANDEM. Kelly is currently working on identifying the code for the specific UI elements desired, for an “ideal world” situation. The next steps for design/UI/UX are to pick a final brand image, and apply it to all our outreach initiatives.

In outreach, TANDEM had a good meeting with Lev on Wednesday. He seems to think we’re doing something other DHers aren’t quite doing. We’re not yet convinced that it’s not just because it’s crazy hard. Either way, we’re up for it. Otherwise, Jojo has been working it hard (Go Jojo!) on all outreach fronts. This week we received interest from Dr. Bill Gleason at Cotsen Children’s Library at Princeton, where they’re working on ABC book digitization and seem especially interested in our image analysis. This response is proof of relevance in the field.

In regards to social media, we now have a proper twitter handle, which we will admit happened in the middle of last week’s class thanks to some pressure from Digital HUAC already having one. You can follow us @dhTANDEM. More on Twitter: TANDEM had a couple really useful retweets (hurray Alex Gil, massively connected Columbia DHer!) that generated some traffic on our website (jetpack has us at 138 views so far, which is not a ton, but it’s a start!) and has won us some good DH followers — @NYCDH, @trameproject. We’ve transferred #picturebookshare to the @dhTANDEM account, and inviting our followers to participate, as well as use it as a means to suggest additional items for our test corpus.

THE MINIMUM VIABLE PRODUCT (MVP)

TANDEM_wireframe

MVP version #1

Because TANDEM is leveraging tools that already exist, one very basic minimum deliverable is that TANDEM makes OCR, NLTK, and OpenCV easy to use. Moreover, if TANDEM itself is not easy to use, there is no inherent advantage in using TANDEM over simply installing the existing tools and running them.

TANDEM as this minimum deliverable would solve the issue of having these tools in a web based environment, relieving users of the laborious headache of installing the component elements. Even after installing the component elements a user would likely have to write code to obtain the required output. TANDEM will shield the user from that need to be a programmer. At this minimum deliverable, TANDEM has not wrapped the three together into a single output.

MVP version #2

A second, more advanced minimum viable product would be to have a website on which a person could upload high resolution TIFF files, press a “run TANDEM” button, and receive a .CSV document containing the core (minimum) output.

The minimum output will consist of six NLTK values (average word length, word count, unique word count, word frequency (excluding stop words), bi-grams and tri-grams) and three image statistics for each input page provided by the user. We hope to expand the range of file types that we can support and to improve the quality of our OCR output, as well as build more elaborate modules for feature detection in in both text and illustrations. However,  we contend that demonstrating the comparative values of a couple corpora of picture books will prove that there is relevant information to be found across corpora with heavy image content.

MVP #3

We are shooting for a single featured MVP. A user comes to www.dhTANDEM.com and uploads a folder of image files following the printed instructions on the screen. They are prompted to hit an “analyze” button. After a few moments, a downloadable file is generated containing OCR-ed text, key data points from the OCR-ed text, and key feature descriptors from the overall image. This is purely for an early adopter looking to generate some useful data so that they can continue working on their story and/or data visualization.​

Concerns

Overall things are going swimmingly. But of course there are concerns. This weeks concerns range from:

  • Can we get the OpenCV to do what we need in the time we have available? This seems to be the element that people really want — the visuality of illustrated print.
  • Will be able to scale the project to process the number of pages we would need for users to get the results that would prove TANDEM’s value?

These are, of course, huge questions. But to put it all in perspective: Stephen Zwiebel told Kelly this week that DH Box was held together “by tape” at the time of the final project presentations, and that it has had a lot more time in the past 9 months to become stable. Not to say that we aren’t looking to have a (minimally viable) product come May, but it’s a good feeling to know where other groups were last year. Should we be sharing that widely with the class? Well, we just did. 🙂

THANKS FOR FOLLOWING @dhTANDEM!

a bicycle built by four, for two (text AND image)

 

NYC Fashion Index Weekly Update

Screenshot_2015-03-03-13-14-32

 

 

As previously mentioned, we chose Python to incorporate data set from Instagram API.

It would be necessary to utilize My SQL for geospatial libraries. For data mining, we focus on collecting geospatial tags in NYC areas to analyze fashion. Using relational database will fit for our project. We are planning to incorporate multiple data sets e.g.) hashtags+Images+Link between images and hashtags+ Iocational information (where the pic was taken, neighborhoods)

Ren and Tessa met Prof. Gold to get feedback for our projects. First of all, our project has to emphasize the “analytic research angle.” It would be better to observe patterns across multiple data sets. Also, we discovered the fact that crowdsourcing was unpredictable and unreliable. Regarding to this matter, we would rather lean more towards curated approaches for our project.

 

Database Question

Not too long ago (within the last couple of years, I think) Oracle acquired MySQL and there was, I know, a fair amount of concern within the open source community that Oracle wouldn’t support it very well–that they might even deliberately try to kill it or convert it into a profit making product. Perhaps these concerns have come true? Does anybody have a sense that the open source community is moving away from MySQL or whether Oracle has done a good job supporting this DBMS?

We have a DocumentCloud account set up now, so if you’d like me to add you to it, let me know what address to use. I definitely want the Digital HUAC folks to explore the tool, but if anyone else is interested in trying it out, by all means let me know.

Tags!

Folks,

Just a reminder to pretty please use your project tags when you post updates here.

Thanks,
Amanda

Moretti in March

Happy March DH Praxers!

I just wanted to share information the lovely digital fellow Erin Glass alerted me to —

Mr. Graphs, Maps, and Trees himself is in town this week!

Franco Moretti is speaking at NYU and both events are open to the public:

I cut and pasted from the NYU site:

  • Wednesday, March 4th, 6:00 p.m.
    First Wednesday Speaker Series and English Dept. Annual Goldstone Lecture: “Micromégas: The very small, the very large, and the object of Digital Humanities,” Franco Moretti (Stanford University)
    Location: Room 102, Cantor Film Center
  • Thursday, March 5th, 12:30 p.m.
    Goldstone Seminar: “Canon/Archive. Large-scale Dynamics in the Literary Field,” Franco Moretti (Stanford University)
    Please RSVP here.
    Location: The Event Space, 244 Greene St.
    I am going and will report back for those who can’t make it!
    -Jojo

Digital HUAC Progress Report- Outreach Plan

This past week, our team has reached out to HUAC experts to help us with our taxonomy, which needs to be finalized before we can start our development work. We have also made a lot of progress in the design and workflow front.

Below is our outreach plan, which includes things that we have already done and what we hope to do.

Objective:

  1. Consult with subject matter and technology experts.
  2. Promote Digital HUAC to potential users, supporters, and adopters.

Target:

Objective #1- Consult with subject matter and technology experts.

  1. Historians or librarians familiar with HUAC
    • Josh Freeman
    • Blanche Cook
    • Jerry Markowitz
    • John Hayes
    • KA Cuordileone
    • Ellen Schrecker
  2. DH or technological advisors
    • Dan Cohen, the historian on Old Bailey project, now DPLA
    • Victoria Kwan and Jay Pinho of SCOTUS Search

Objective #2- Promote Digital HUAC to potential users, supporters, and adopters.

  1. Digital humanities scholars & programs
    • Stanford Digital Humanities
    • UCLA DH
    • DiRT Directory
    • List on HASTAC site
  2. Academics
    • American History
    • Political Science
    • Linguistics
  3. High School Educators
    • History
    • Civics
    • US Government
    • National Council of Social Studies
  4. Archives, Collections, and Libraries
    • Woodrow Wilson Cold War Archive
    • Truman Library
    • Tamiment Library
    • Harriman Institute
    • Kennan Institute
    • Davis Center at Harvard University
    • The Internet Archives (archives.org)
  5. Associations
    • American Historical Association
    • Association of American Archivists
  6. Academic journals
    • Digital Humanities Quarterly
    • Journal of Digital Humanities
    • American Communist History
  7. Blogs
    • LOC Digital Preservation blog
    • DH Now
    • DH + Lib
  8. Other related DH Projects
    • SCOTUS Search
    • Old Bailey
    • NYPL Labs
    • Digital Public Library of America

Approach

Objective #1:

Outreach started on February 19.

  1. Email referrals from Matt, Steve, Luke and Daria.
  2. Find other experts through online research.

Objective #2:

Outreach to start on March 10.

  1. Social media: Twitter (@DigitalHUAC) and Wikipedia page.
  2. Create emails lists of key contacts of the above listed organizations.
  3. Prepare user/supporter-specific emails for email blast.
    • User- why this project is relevant and how it can help them with their research, what this database offers that the current state of HUAC transcripts does not
    • Supporter- why this project is relevant to the academic community and if they would consider doing a write-up or linking our site to their “Resources” page. (try to secure some kind of endorsement.)
  4. Dates of outreach:
    • April 15- Introducing the project (website launch)
    • May 10- Introducing the APIs
    • May 19- Project finalized, with full documentation

Pitch (the “voice”):

Objective #1:

(Students working on a semester-long project, looking for guidance.)

To DH practitioners: Our project, Digital HUAC, aims to develop the House Un-American Committee (HUAC) testimony transcripts into a flexible research environment by opening the transcripts for data and textual analysis. We seek to establish a working relationship with both digital humanities practitioners and HUAC experts to help advise the technological and scholarly aspects of our project more broadly, especially given that our hope is for Digital HUAC to grow and thrive past our work this semester. Our project is the first attempt to organize HUAC materials in this way, using digital humanities methodologies. We see great opportunity for collaboration with the academic community and additional academic research, as we are opening up a resource that has not been easily accessible and usable previously. We believe our efforts can help uncover new research topics, across disciplines, by utilizing DH research methods.

To historians: We are working on a semester-long project which aims to make the full text of the House Un-American Committee (HUAC) testimony transcripts into a searchable online archive. Our project is the first attempt to collect and organize HUAC transcripts online in one central, searchable location. The first stage of this project is to take our sample set of 5 testimony transcripts and denote common identifiers that will be useful to researchers using the archive. These common identifiers will allow our users to search based on categories of data, as opposed to only simple word searches, giving more value to the transcripts. We have developed a list of these identifiers (also known as a controlled vocabulary), but would like a historian with a deeper working knowledge of the HUAC hearings to advise us on this list.

Going forward, we hope to establish a working relationship with HUAC experts to help advise scholarly aspects of our project more broadly, especially given that our hope is for Digital HUAC to grow and thrive past our work this semester.

Objective #2:

(Pitching to potential users.)

We are excited to present Digital HUAC, an interactive repository of the House Un-American Activities Committee (HUAC) testimonies that uses computational methods for data and textual analysis. This is the first attempt to create such a database for the HUAC transcripts, which currently are not centralized in one location, nor are they all searchable. Our aim is to develop the HUAC transcripts into a flexible research environment by giving users the tools to discern patterns, find testimonies based on categories and keywords, conduct in-depth data and textual analysis, as well as export data sets. For the beta stage of this project, we will start with five selected testimonies.

Researchers: Digital HUAC is an interactive repository that will give researchers unprecedented access to HUAC transcripts. Supported with advanced search functionality across all records and a built-in API for additional data and text analysis, Digital HUAC has opened up one of the largest collections of primary source material on American Cold War history. Researchers will now be able to use the HUAC transcripts for comparative political analysis, informant visualization, social discourse analysis of court transcripts, linguistics analysis, as well as other research topics that have not been realized due to the previous inaccessibility of the HUAC transcripts.

High School Teachers. Digital HUAC aims to provide access to one of the most substantive collections of primary source material on American Cold War history. Your students will have the opportunity to delve further into their inquiry learning through the repository’s search functionality and PDF library of the original material. While the subject matter may be vast and complex, we have created a supportive research and learning environment with an easy to use interface, clean results pages, customizable options to save and export searches, and assistance with citation.

 

TANDEM: Team Project Update

TANDEM related information now has a home on our Commons page.

Technology Notes Week 4 (via Steve)

To build Tandem, we are utilizing a variety of existing tools. The tools are:

  1. NLTK: We plan to use NLTK to work with language data after it has been OCR’d by Tesseract. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging and parsing.
  2. FeatureExtractor (or QTip): We need a tool that will analyze the features of the images on each page of a corpus. Both of these tools are developed by The Software Studies Lab run by Lev Manovich. While FeatureExtractor is more powerful, QTip is easier to use. We have run into an issue determining the viability of redistributing FeatureExtractor as it relies on proprietary software. We continue to work on whether we can work with this. In the mean time we have scheduled a meeting with Lev to discuss our options.
  3. Tesseract: This is an open source OCR tool provided by Google. We will use this tool to identify text elements on each page of the corpus under analysis. The output of Tesseract will be input to NLTK for analysis. While we continue to lean on Tesseract, our outreach has yielded information about Ocular via USC Berkeley. We have reached out to the programmers behind this alternative OCR option and they have been receptive and helpful. We are vetting this as a possible OCR option.

NLTK

Programming

We are currently writing a Python Script that will pass a directory of files to NLTK for processing.

Status

NLTK install complete on developer’s and PM’s machine. Prototype program completed and tested by developer. Prototype program enhanced by PM to take multiple input files and pass them to NLTK for multiple different analyses.

Next Steps

Enhance script to handle variable number of input files that are output from the OCR Step.

______

QTip

Programming

TBD. QTip does not seem to have an API or a way to launched from a Python Script

Status

Install complete and tested.

Next Steps

Determine if there is a way to utilize QTip from a Python program. A meeting is scheduled with Lev Manovich on March 4 to discuss this.

______

Tesseract

Programming

A prototype module has been created and tested by the developer to process a variable number of PNG files through the Tesseract OCR engine.

Status

Install complete on the devleoper’s computer. A prototype program has been tested that verifies that viability of using Tesseract with PNG files. A significant challenge will be to find methods to improve the poor quality of the OCR output.

Next Steps

We will focus on improving the quality of the output and to test the OCR engine with other types of input files (TIF, JPEG, GIF and BMP are obvious candidates). We also continue to explore alternative OCR engines that may work better than Tesseract.

______

On the outreach front:

  • We continue to start conversations via the #picturebookshare and #tandem hashtags on Twitter.
  • Emails are currently being exchanged on the OCR and image feature extraction front to determine what best practices we can take advantage of.

______

On the design front:

Focusing mainly on our backend functionality, all forward facing design work continues to remain in the outreach department. We are working on playful picture book related logos and marketing materials as well as maintaining a presence in the illustration community. We are not yet at the stage of developing designs for the user interface, but we are beginning to consider the types of functionality, buttons, sliders, etc, that we will incorporate in the final tool.

Process Report CUNYcast

CUNYCast is an online experimental broadcast in the Digital Humanities. The CUNYcast site will model Ds106 Radio. It will also document the process, and create a “how to” manual for future CUNYcasters.  A link from the CUNYCast group page on the Academic Commons will lead people to an external site where content will be streamed. CUNYcast is a live online radio stream that anyone can take over and populate with their own DH audio radio broadcast. Cunycast is a non-archivable broadcast that will be accessible on the web. CUNYCast’s aim is to empower a DH guerrilla broadcast community.

Our team’s goal this week was to test an audio upload to Ds106 Radio, and begin to build out our WordPress site, while documenting and reporting on our process and our progress. *Note: although documentation appears here, it has not been verified between team members. Please do not attempt or post until we have completed final edits on the manual. Thanks!

Process Report 2/25/15:

Joy edited in-class audio from DH Praxis 2014-15, added music, and recorded an introduction.  James’s task was to upload that content in order to better understand how Ds106 radio works.

  1. Using edited audio recordings of our in-class conversations James converted ab .m4a, (advanced audio coding (AAC) file format) to an mp3 file.
  2. Using online converter media.io took about three minutes to convert, reducing its size from 28MB to 19MB.

Note: James chose 128kb/s as the quality, remembering that Ds106 radio has a 128kb/s stream. Next, we needed to figure out what would come first, the ds106radio how-to, Airtime, or Icecast? Airtime has a giant button on their landing page that says START NOW, so that seemed like a good place to begin. 30 day free trial, otherwise it’s 9.95 a month.

Question: If we do work with Ds106 we’ll have to get them to “grant a login, we think? Though it also possible that when we are preparing our radio station, it might cost us $10 monthly to maintain it via Airitme?

  1. Ds106radio is located in the interwebs, and how to access it via Icecast, it links to here: http://networkeffects.ca/?p=1478
  2. Download Icecast here: http://icecast.org/download/
  3. Start Icecast. It launches a console.
  4. Follow instructions by typing the address into Chrome.

Note: If I we were hosting Icecast via our local machine, this is how it would be controlled.

  1. Go to the Icecast installation directory and find a .xml doc.
  2. Open with my text

Note: This seems like it will be very important later, but we’re not sure that it will help complete the goal now. The next thing we attempt to try is looking at “Broadcasting Software” in the ds106radio how-to. We come across this document. We go for Mixxx; another broadcasting tool.

  1. Download Mixxx. Mixxx is 85MB: It does audio editing, mixing, broadcasting, recording.
  2. Enable Live Broadcasting

Note: It began importing James’s whole audio library. He loaded a song and just played with some dials. He encourages everyone to do this.

  1. Open up our cmd (command prompt and type in some commands for installing the codec:

Photo:James_Broadcast

Note: Watch for compatibility issues. We had a 64-bit version of Mixxx that was accidentally installing the 32-bit encoder. Some folders are inaccurately named. For Macs, this process seems smoother.

  1. Load up the audio for broadcast on ds106radio into Mixxx, by dragging and dropping. take
  2. Take the server info from ds106radio and put it into Mixxx:

Name: ds106rad.io / Server: ds106rad.io / Port: 8010 / Mountpoint: live / Username: source / Password: ds106 / Codec: mp3 / Bitrate: 128 (or less) / Protocol: Icecast2 / Stereo: Y/N

  1. Success = playing live audio from our class on Ds106

Process Report 2/28/15:

  1. This week Julia went to a workshop on “Bootstrap”http://getbootstrap.com/
    It is a model for a responsive website.
    2. This is our template:http://getbootstrap.com/examples/carousel/#
    Note: We have had some concerns with the constraints of wordpress. This will aford us more freedom although it may require a bit more now to update and change the site = More freedom less of a fancy wordpress back end.file structure
  2. Download Bootstrap; accessed here:http://getbootstrap.com/
  3. Use textwrangler (a bare bones html building editor) she saved the document as a .html file like (index.html).
  4. Place the file in the same folder on the desk top that held the Bootstrap.

Note: We are assuming that this series of files will be able to be uploaded to a server so they may become live. There may be a few steps missing that we’re unaware of since we’re not directly familiar with server setups.

5. Using Textwrengler to build the site; start with a blank text editor. Go to the template mentioned above (http://getbootstrap.com/examples/carousel/#) and open the site. It is a browser and look at the view page source option.

6. Copy and paste the page source from that page and place it into a plain text document.

Note: CSS of this document was all whacked out at first. The file connections to the rest of the folders would be different if they were sitting on the desktop.

  1. Go through the preliminary documentation to fix the <!DOCTYPE html> heading issues in the .html file.
  2. Screen shot of the website displayed in a browser on her computer. It is bare bones but it does display.CUNYcast_Web_SampleNote: Julia will next play with the style of CUNYcast site to reflect the new direction of the project. Barbara Kruger is a visual inspiration since we’re going guerrilla.

Please join us at our new twitter account @CUNYcast #CUNYcast
Also, we’ll be making our group page public on the commons this week.