Tag Archives: Digital HUAC


Digital HUAC- Project Update

This week, our team found the answer to our biggest development hurdle- DocumentCloud. Prior to this discovery, we were trying to figure out how to create a relational database, which would store meta tags of our corpus, that would respond to user input in our website’s search form.

It turns out that DocumentCloud, with an Open Calais backend, is able to create semantic metadata from document uploads and can pull the entities within the text. The ability to recognize entities (places, people, organizations) is particularly helpful for our project since these would be potential search categories. We are also able to create customized search categories through DocumentCloud by creating key value pairs. On Tuesday, we uploaded our 5 HUAC testimonies and started to create key value pairs, which are based on our taxonomy. (Earlier this week, we finalized our taxonomy after receiving feedback on our taxonomy from Professor Schrecker at Yeshiva University and Professor Cuordileone at CUNY City Tech.) In order to create these key value pairs, we had to read through each transcript and pull our answers, like this:

Field Notes & Examples Rand Brecht Disney Reagan Seeger
Hearing Date year-mo-day, 2015-03-10 1947-10-20 1947-10-30 1947-10-24 1947-10-23 1955-08-18
Congressional Session number 80th 80th 80th 80th 84th
Subject of Hearing Hollywood Hollywood Hollywood Hollywood Hollywood
Hearing location City, 2 letter state Washington, DC Washington, DC Washington, DC Washington, DC New York, NY
Witness Name Last Name, First Middle Rand, Ayn Brecht, Bertolt Disney, Walt Reagan, Ronald W. Seeger, Pete
Witness Occupation or profession Author Playwright Producer Actor Musician
Witness Organizational Affiliation Walt Disney Studios Screen Actors Guild People’s Songs
Type of Witness Friendly or Unfriendly Friendly Unfriendly Friendly Friendly Unfriendly
Result of appearance contempt charge, blacklist, conviction Blacklist Contempt charge, but successfully appealed; Blacklist

With DocumentCloud thrown back into the mix, we had to take a step back and start again with site schematics. We discussed each step of how the user would move through the site, down to the click, and how the backend would work to fulfill the user input in the search form. (Thanks, Amanda!) In terms of development, we will need to create a script (Python or PHP) that will allow the user’s input in the search box to “talk” to the DocumentCloud API and pull the appropriate data.


Amanda mentioned DocumentCloud to us a while ago, but our group thought it was more of a repository than a tool, so our plan was to investigate it later, after we figured out how to build a database. After hounding the Digital Fellows for the past couple of weeks on how to create a relational database, they finally told us, “You need to look at DocumentCloud.” Moral of the story: Question what you think you know.

On the design front, we started working in Bootstrap and have been experimenting with Github. We were able to push a test site through Github pages, but we still need to work on how to upload the rest of our site directory. This is our latest design of the site:


Digital HUAC: MVP Post

Over the course of this project so far, and in relation to the feedback that we’ve been receiving, we have scaled up and down our goals and expectations. It has been both humbling and empowering to consider everything we can do within the constraints of a single semester project. When asked to brainstorm our minimum viable product (MVP) this week, over a conference call we all agreed on the following:

– a central repository with basic search functionality that stores our corpus of 5 transcripts.

– a database that can be scaled.

What does this mean, and how does it differ from our current project goals?

We are attempting to generate a platform that connects a relational database to a robust search interface and utilizes an API to allow users to extract data. We envision Digital HUAC to be the start of a broader effort to organize HUAC transcripts and allow researchers and educators access to their every character. By allowing advanced searches driven by keywords and categories, we seek to allow users to drill down into the text of the transcripts.

Our MVP focuses on storing the transcripts in a digital environment that returns simple search results: in absence of a robust search mechanism, users would instead receive results indicating, for example, that a sought after term appeared in a given transcript and not much more.

Our MVP must lay out a model for a scalable database. We are still very much figuring out exactly how our database will operate, so it is hard to fully commit to what even a pared-down version of this would look like. But we know that the MVP version must work with plain text files as the input and searchable files as the output.

Generating an MVP has been a useful thought experiment. It has forced us to hone in on the twin narrative and technical theses of this project: essentially, if everything else was stripped away, what must be left standing. For us, this means providing basic search results and a working model of a relational database that, given appropriate time and resources, could be expanded to accommodate a much greater corpus.

Digital HUAC – Workplan & Wireframe & Update




Digital HUAC - Workflow_Page_1

Workplan: what & why

Pages from Digital HUAC - Workflow


The documents (which are already scanned) will be manually tagged using an XML editor according to identified categories, then read into an open-source relational database (MySQL), which reads XML documents. The MySQL database will be incorporated into the website using PHP in conjunction with the site (syntax—PHP within the HTML/CSS site schema). Finally, the API will allow users to export their searches to text-analysis resources.

Historians and Corpus

We’ve identified a number of historians, librarians and archivists, and digital humanists to potentially work with on this project and are in the process of reaching out to them in an advisory capacity. We seek guidance on our taxonomy and controlled vocabularies in the short term, and on future developments of our project beyond the scope of this semester.

At the top of this list are historians Blanche Cook and Josh Freeman, CUNY professors and experts on the HUAC era. Steve Brier is in the process of introducing us to both Cook and Freeman. Other historians include Ellen Schrecker (Yeshiva), Mary Nolan (NYU), Jonathan Zimmerman (NYU), and Victoria Phillips (Columbia), each with subject expertise and research experience on the time, events, and people central to Digital HUAC. We have also identified Peter Leonard, a DH librarian at Yale; David Gary, the American History subject specialist at Yale who holds a PhD in American History from CUNY; John Haynes, a historian who served as a specialist in 20th-century political history the Manuscript Division of the Library of Congress; and Jim Armistead and Sam Rushay, archivists at the Truman Library, as potential advisors.
We have narrowed down the corpus of text that we’ll be working with to include 5 transcripts: Bertold Brecht; Ronald Reagan; Ayn Rand; Pete Seeger; and Walt Disney. This list of major cultural figures spans the hearings themselves and features both friendly and hostile witnesses, offering users a varied look into the nuances of interrogation. It is our opinion that by focusing on a witness base of recognizable figures that is thematically organized, users may examine their testimony as individuals and in context with one another. This quality of the HUAC hearings cannot be understated, and Digital HUAC seeks to draw attention to it through the overall user experience.


HUAC User Stories

User Story #1: A forensic computational linguist doing research on how interviewing style impacts witness responses. The value of the site to the user is being able to compare friendly vs. unfriendly witnesses (difficult to determine in general court transcripts) and the sheer number of available court transcripts available (also difficult to collect re general court transcripts). The person clicks the API link and follows the prompts to extract a cluster of readings from unfriendly witness testimony, and does a second export for a cluster from friendly witness testimony. The API exports the two corpora into an intermediary location (such as Zotero), which can be used with Python (NLTK) to compare, for example, the number of times interviewers repeated question for friendly vs. unfriendly.

User Story #2: High school civics & US history teacher, Chris. He is wants to assign the students to search the archive to find primary source documents from the McCarthy era. Students will have a list of topics and names to choose from as their research areas. Chris tests the site to see if it will be useful to his students. Chris uses the simple search box to search for both topics such as ‘treason’ and ‘democracy’. Chris uses the advanced search options to combine topics with names.  Chris is looking for clean results pages, the option to save and export searches, and help with citation.

User Story #3: American Political Scientist, Jennie, doing research on US government responses during periods of perceived national security threats. Specifically, she is interested in the Foreign Intelligence Surveillance Courts (FISC), which are closed courts. Jennie wants to read about now-released documents that record the conduct of closed courts. Jennie wants to do mixture of qualitative and quantitative analysis. Qualitative: Jennie uses the advanced search to specify she wants only to look at hearings that have been identified in the category as ‘closed’ hearings, then does the same to specify only ‘open’ hearings. Quantitative: Jennie uses API and follows the prompts to extract data with the category ‘closed’ and date filter to do a statistical analysis of number of closed trials by year and how/if correlated to outside events.