Digital HUAC Project Plan
Team Members and Roles
- Daria V. – Developer
- Sarah C. – Designer
- Chris M. – Project Manager
- Juliana S. – Outreach Coordinator
The House Un-American Activities Committee (HUAC), a committee of the U.S. House of Representatives, was established in 1938 to investigate subversive organizations in the U.S. government. By 1945, its focus had shifted to investigating potential Communist activities. HUAC remained active until 1975.
The HUAC records—over tens of thousands of pages from hundreds of hearings—represent one of the most substantive and largest collections of twentieth century American history, providing primary source material on American Cold War history and a unique account of government responses to real and perceived political dissent at key moments in American history.
Currently, the HUAC records are difficult to locate and only few of the transcripts have been digitized, making it difficult to read individual trials or locate specific people, much less search for meaningful patterns. The possibility of using HUAC records for research interests, such as comparative historical political analysis and forensic computational linguistics, remains untapped.
The Digital HUAC project aims to develop the HUAC transcripts as a flexible learning environment for future digital humanities projects by opening the transcripts for data and textual analysis. This project intends to create a keyword-enabled, fully searchable online version of a small sampling of HUAC testimony and an API to export datasets for further data, textual, and visual analysis. The final product will be easily scalable to eventually encompass the complete HUAC records.
What problem does this solve?
We believe HUAC material has a wide academic audience and substantial cultural significance. However, in their current form, HUAC records are difficult to locate, much less use, preventing the full extent of possible methods and research topics. Some, though not all, hearing transcripts have been digitized and are available as PDFs, making complex searching and text analyses difficult. This project would remediate such shortcomings. As it stands now, historians are the primary users of the HUAC transcripts, both physical and digital. We seek to bring the transcripts to this audience, as well as researchers from other fields and interested members of the general public, through a user-friendly interface paired with a robust export mechanism.
What lacuna does it fill?
The HUAC records lend themselves well to textual analysis. Besides the sheer scope of the material—almost forty years of hearings and reports—the constantly shifting focus of the committee in response to current events make it a useful tool for the visualization of trends over time. The dynamic qualities of the transcripts–such as variations in speakers and speech–make them of interest to more than just historians. Disciplines diverse as forensic computational linguistics, American Studies, and media studies will have an interest in examining HUAC transcripts. Such cross-disciplinary applications reinforce the value this project has to the wider DH community. With this project, researchers will be able to easily search and export their results, thus addressing a need for these texts to be available for data and textual analysis.
What similar projects are there?
Currently there are no other projects working with HUAC transcripts. Some, though not all, hearing transcripts have been digitized and are available as PDFs from a variety of archival locations online.
Conceptually, the Old Bailey Project, which created a searchable digital repository and tool for querying trial reports published in the Old Bailey Proceedings from 1674 to 1913, is similar to this digital HUAC project.
Other projects have used firsthand accounts of politicians and government officials to shed light on aspects of diplomatic history, allowing researchers to analyze rather than intuit the rationale behind the enactment of important political events. The Columbia Rule of Law Oral History Project is one. Micki Kaufman’s Quantifying Kissinger project, which uses diplomatic archives for the purpose of computational analysis, is an example of the type of distant reading research that could be produced from this project’s material.
The anticipated technical requirements for this project are: (1) TEI (Text Encoding Initiative)-compliant XML markup; (2) search query indexing using SQL/MySQL; (3) website development (HTML/CSS/JS); and (4) API development. We intend to start with transcripts that are already in plain text format; no OCR technology will be needed at this time.
- Which of these are known?
- Basic website development & HTML
- Basic XML markup
- Which need to be learned?
- API development- already have a working knowledge of Python for API development, but may need to consult with resources from time to time.
- What’s plan to learn them? What support is needed?
- Lynda.com, Code Academy, DH forums, and other online resources will be utilized for basic training and questions
- GC Digital Fellows will also be utilized for questions
- Old Bailey process notes and white papers will be consulted
The tools we will need for each step from processing to presentation are as follows
- Gathering: taking plain text transcripts openly available on web, no specific tools needed
- Parsing and analyzing: XML markup will be done manually, no specific tools needed
- Storing: Reclaim Hosting site, DocumentCloud
- Retrieval: SQL queries, Saxon to create tab delimited data files, MySQL for indexing
- Display: Reclaim Hosting site
The XML markup on our selected documents will be generated manually by the group members. The digitized text will be searchable for any character string, but in order to facilitate structured searching and the generation of statistics, the text will also be marked up in XML. The following tags have been identified for the XML markup. The choice of these tags was driven by the material and what searches would generate the most useful results for both data mining purposes as well as for basic searches.
- Session number
- Subject of hearing
- Investigator name
- Investigator state
- Investigator political party
- Investigator gender
- Witness name
- Witness hometown
- Witness occupation
- Witness gender
- Reason for appearance
- Type of defendant (friendly, unfriendly)
- Result of appearance (contempt charge, blacklist, conviction, etc)
Additionally, we would like to be able to display original page images along with the transcribed text. Since these PDFs are readily available, we would need to upload them to our site. How this is done will be determined by how much space we have on our Reclaim Hosting site. Other possibilities include storing the PDFs on DocumentCloud.
How will the project be managed?
Our team will use Google Drive as a central repository to hold all of our project documents, including research notes, reading and resource lists, project plan, project updates for the blog, master task list with deadlines and responsibility initials, and any other work product.
|XML markup completed||3/15/2015|
|SQL indexing completed||3/20/2015|
|Web interface built||4/15/2015|
|Search testing phase||4/15-4/30/2015|
|Project Completed (website launched, white paper submitted)||5/19/2015|
This is good progress. A few points:
– I’d encourage you to find a historian of anti-communism with whom to talk about your taxonomy. It looks good to me, but there may be some categories you’re leaving out. For instance, did the HUAC committee travel (I think the Kefauver commission did); if it did so, then location of hearing might be necessary. Was the hearing televised? I’m blanking on a whether there is a historian at CUNY to consult about this; perhaps Josh Freeman, or Blanche Cooke? Email Steve about this asap.
– I’d like to see an index of what records are where and in what format for the full run of the committee
– The plan to learn the technologies you’ll ultimately need as presented here is currently insufficient, and its development is not something you can put off. It needs to happen at the same time the markup is happening. We’ll discuss in class tomorrow.
I’ll be in touch with Reclaim Hosting this week about getting you guys set up, but may need to confirm that the packages you’ll need will be available. You’ll also need to get me a ballpark for your storage needs (the index above will help).
Adding to Luke’s response here, I’d like to see you looking at other document archives, beyond just Old Bailey, for inspiration and guidance.
And since this is very much a *story* project (as opposed to a tool project) we’re going to be looking to you for clear documentation and guidance for future archivists about how you handled the archive.
Sorry I was blanking on this earlier, but the full-text search engine I was thinking of is called SOLR:
for a modest sample set, MySQL or Postgres is not a bad option. I’d advise against SQLite at this stage only because you don’t have enough of a handle on what you’re building and what you need it to do, from a technical perspective, to even ask the right questions.
My sense is that you’re not doing a lot of writing *to* the database, just pulling from it, in which case you don’t need a database that is optimized for concurrent transactions (ie. two different processes try to edit or add to the database at the same time) but that would be one consideration. I’m looking forward to seeing your schematics so I can help you guys plan this project out.
This reminded me of your project: http://www.scotussearch.com/