Author Archives: JULIANA SON

Digital HUAC- Project Update

This week, our team found the answer to our biggest development hurdle- DocumentCloud. Prior to this discovery, we were trying to figure out how to create a relational database, which would store meta tags of our corpus, that would respond to user input in our website’s search form.

It turns out that DocumentCloud, with an Open Calais backend, is able to create semantic metadata from document uploads and can pull the entities within the text. The ability to recognize entities (places, people, organizations) is particularly helpful for our project since these would be potential search categories. We are also able to create customized search categories through DocumentCloud by creating key value pairs. On Tuesday, we uploaded our 5 HUAC testimonies and started to create key value pairs, which are based on our taxonomy. (Earlier this week, we finalized our taxonomy after receiving feedback on our taxonomy from Professor Schrecker at Yeshiva University and Professor Cuordileone at CUNY City Tech.) In order to create these key value pairs, we had to read through each transcript and pull our answers, like this:

Field	Notes & Examples	Rand	Brecht	Disney	Reagan	Seeger
Hearing Date	year-mo-day, 2015-03-10	1947-10-20	1947-10-30	1947-10-24	1947-10-23	1955-08-18
Congressional Session number		80th	80th	80th	80th	84th
Subject of Hearing		Hollywood	Hollywood	Hollywood	Hollywood	Hollywood
Hearing location	City, 2 letter state	Washington, DC	Washington, DC	Washington, DC	Washington, DC	New York, NY
Witness Name	Last Name, First Middle	Rand, Ayn	Brecht, Bertolt	Disney, Walt	Reagan, Ronald W.	Seeger, Pete
Witness Occupation	or profession	Author	Playwright	Producer	Actor	Musician
Witness Organizational Affiliation				Walt Disney Studios	Screen Actors Guild	People’s Songs
Type of Witness	Friendly or Unfriendly	Friendly	Unfriendly	Friendly	Friendly	Unfriendly
Result of appearance	contempt charge, blacklist, conviction		Blacklist			Contempt charge, but successfully appealed; Blacklist

With DocumentCloud thrown back into the mix, we had to take a step back and start again with site schematics. We discussed each step of how the user would move through the site, down to the click, and how the backend would work to fulfill the user input in the search form. (Thanks, Amanda!) In terms of development, we will need to create a script (Python or PHP) that will allow the user’s input in the search box to “talk” to the DocumentCloud API and pull the appropriate data.

Amanda mentioned DocumentCloud to us a while ago, but our group thought it was more of a repository than a tool, so our plan was to investigate it later, after we figured out how to build a database. After hounding the Digital Fellows for the past couple of weeks on how to create a relational database, they finally told us, “You need to look at DocumentCloud.” Moral of the story: Question what you think you know.

On the design front, we started working in Bootstrap and have been experimenting with Github. We were able to push a test site through Github pages, but we still need to work on how to upload the rest of our site directory. This is our latest design of the site:

Digital HUAC Progress Report- Outreach Plan

This past week, our team has reached out to HUAC experts to help us with our taxonomy, which needs to be finalized before we can start our development work. We have also made a lot of progress in the design and workflow front.

Below is our outreach plan, which includes things that we have already done and what we hope to do.

Objective:

Consult with subject matter and technology experts.
Promote Digital HUAC to potential users, supporters, and adopters.

Target:

Objective #1- Consult with subject matter and technology experts.

Historians or librarians familiar with HUAC
- Josh Freeman
- Blanche Cook
- Jerry Markowitz
- John Hayes
- KA Cuordileone
- Ellen Schrecker
DH or technological advisors
- Dan Cohen, the historian on Old Bailey project, now DPLA
- Victoria Kwan and Jay Pinho of SCOTUS Search

Objective #2- Promote Digital HUAC to potential users, supporters, and adopters.

Digital humanities scholars & programs
- Stanford Digital Humanities
- UCLA DH
- DiRT Directory
- List on HASTAC site
Academics
- American History
- Political Science
- Linguistics
High School Educators
- History
- Civics
- US Government
- National Council of Social Studies
Archives, Collections, and Libraries
- Woodrow Wilson Cold War Archive
- Truman Library
- Tamiment Library
- Harriman Institute
- Kennan Institute
- Davis Center at Harvard University
- The Internet Archives (archives.org)
Associations
- American Historical Association
- Association of American Archivists
Academic journals
- Digital Humanities Quarterly
- Journal of Digital Humanities
- American Communist History
Blogs
- LOC Digital Preservation blog
- DH Now
- DH + Lib
Other related DH Projects
- SCOTUS Search
- Old Bailey
- NYPL Labs
- Digital Public Library of America

Approach

Objective #1:

Outreach started on February 19.

Email referrals from Matt, Steve, Luke and Daria.
Find other experts through online research.

Objective #2:

Outreach to start on March 10.

Social media: Twitter (@DigitalHUAC) and Wikipedia page.
Create emails lists of key contacts of the above listed organizations.
Prepare user/supporter-specific emails for email blast.
- User- why this project is relevant and how it can help them with their research, what this database offers that the current state of HUAC transcripts does not
- Supporter- why this project is relevant to the academic community and if they would consider doing a write-up or linking our site to their “Resources” page. (try to secure some kind of endorsement.)
Dates of outreach:
- April 15- Introducing the project (website launch)
- May 10- Introducing the APIs
- May 19- Project finalized, with full documentation

Pitch (the “voice”):

Objective #1:

(Students working on a semester-long project, looking for guidance.)

To DH practitioners: Our project, Digital HUAC, aims to develop the House Un-American Committee (HUAC) testimony transcripts into a flexible research environment by opening the transcripts for data and textual analysis. We seek to establish a working relationship with both digital humanities practitioners and HUAC experts to help advise the technological and scholarly aspects of our project more broadly, especially given that our hope is for Digital HUAC to grow and thrive past our work this semester. Our project is the first attempt to organize HUAC materials in this way, using digital humanities methodologies. We see great opportunity for collaboration with the academic community and additional academic research, as we are opening up a resource that has not been easily accessible and usable previously. We believe our efforts can help uncover new research topics, across disciplines, by utilizing DH research methods.

To historians: We are working on a semester-long project which aims to make the full text of the House Un-American Committee (HUAC) testimony transcripts into a searchable online archive. Our project is the first attempt to collect and organize HUAC transcripts online in one central, searchable location. The first stage of this project is to take our sample set of 5 testimony transcripts and denote common identifiers that will be useful to researchers using the archive. These common identifiers will allow our users to search based on categories of data, as opposed to only simple word searches, giving more value to the transcripts. We have developed a list of these identifiers (also known as a controlled vocabulary), but would like a historian with a deeper working knowledge of the HUAC hearings to advise us on this list.

Going forward, we hope to establish a working relationship with HUAC experts to help advise scholarly aspects of our project more broadly, especially given that our hope is for Digital HUAC to grow and thrive past our work this semester.

Objective #2:

(Pitching to potential users.)

We are excited to present Digital HUAC, an interactive repository of the House Un-American Activities Committee (HUAC) testimonies that uses computational methods for data and textual analysis. This is the first attempt to create such a database for the HUAC transcripts, which currently are not centralized in one location, nor are they all searchable. Our aim is to develop the HUAC transcripts into a flexible research environment by giving users the tools to discern patterns, find testimonies based on categories and keywords, conduct in-depth data and textual analysis, as well as export data sets. For the beta stage of this project, we will start with five selected testimonies.

Researchers: Digital HUAC is an interactive repository that will give researchers unprecedented access to HUAC transcripts. Supported with advanced search functionality across all records and a built-in API for additional data and text analysis, Digital HUAC has opened up one of the largest collections of primary source material on American Cold War history. Researchers will now be able to use the HUAC transcripts for comparative political analysis, informant visualization, social discourse analysis of court transcripts, linguistics analysis, as well as other research topics that have not been realized due to the previous inaccessibility of the HUAC transcripts.

High School Teachers. Digital HUAC aims to provide access to one of the most substantive collections of primary source material on American Cold War history. Your students will have the opportunity to delve further into their inquiry learning through the repository’s search functionality and PDF library of the original material. While the subject matter may be vast and complex, we have created a supportive research and learning environment with an easy to use interface, clean results pages, customizable options to save and export searches, and assistance with citation.

#skillset – Juliana

Hey all,

Good to see you guys again. Here are my skill sets:

Outreach Coordinator: My professional life include 6 years of business development, marketing and public relations. What I’m particularly good at is finding new areas of expansion, by finding potential clients/partnerships and seeing how we can enter a mutually beneficial relationship. In more friendly, DH speak, I can find possible collaboration opportunities that can help our project go beyond its reach as if it was just on its own- whether it is finding a like-minded organization that can link to our project site, sending project updates to publications (not just academic), inviting fellow DH’ers during the beta testing phases, or exchanging data, ideas, or code with others to build up a network. I think an outreach coordinator should go beyond telling the world about our project, but also attempt to build and actively engage with a community, much like a brand. My everyday job is asking and answering questions like: Why would anyone care? How does this help other people? Why should they share our story?
Project Manager: I’m organized, deadline oriented and I pick up things pretty quickly, which is important when you are collaborating cross-functions. I may not be the best developer, but I trust my abilities to pick up the necessary details so I can talk about it intelligently. I’m good at stepping back and thinking how we can make better connections so all the functions are working at optimal levels, like coordinating outreach on important development stages. I’m also willing and able to help other team members in case they need additional help.
Designer/UX: I have opinions on what looks nice and my personal preference on user experience. But my skill set in this area is limited. I’m not proficient in any of the programs that others have named. But I am willing to learn.
Developer: I would love to develop this skill set more. I’m taking the module on HTML and CSS at the J-School this month. Depending on the project’s technical requirements, I would be open to it.

Big Data and the museum

Great job on the presentations, everyone! Really interesting stuff– and so diverse in topics and approaches.

I wanted to share this article that I just read in The Wall Street Journal: http://www.wsj.com/articles/when-the-art-is-watching-you-1418338759

The article discusses the use of visitor tracking information in the museum to help make curatorial decisions. We’ve been seeing this a lot lately, using technology to track what is popular to reproduce it. It makes sense in terms of profit, but it really doesn’t leave much room for creativity and the artistic spirit, which tends to be counterculture before becoming mainstream.

NYPL Labs visit

Yesterday’s visit by NYPL Labs was inspiring. What we discussed today was mostly discussed before in the semester, but it was refreshing to hear it from non-academics, DH practitioners who carried a passionate and playful tone (though still obviously knowledgeable) that wasn’t over-analyzing/intellectualizing/rehearsed (that’s not to say that our previous guests were). Josh (?) was almost poetic in describing how they aimed to “breathe life into the collection” and save it from being “frozen in amber.”

As I mentioned in class, I’m proposing a project of digitizing an series of installations curated by the APA Institute at NYU. I’ve been tackling some methodological and theoretical issues that we luckily addressed, mostly on the original consumption of the archive, observer’s experience of serendipity, and how to address what is not represented.

What was the original intended consumption of the archived object and how do we translate it something that is native to digital? Johanna Drucker addressed this in her critique of eBooks, which “often mimics the most kitsch elements of book iconography” and in doing so we only stimulate “the way a book looks” (Drucker, 2008, 216-217) and not thinking about how it is used and how we can extend that type of thinking to the digital environment. The NYPL Labs had a creative take to this question with their 3D images site, http://stereo.nypl.org/.
How do we recreate the experience of accidental discovery/serendipity in the digital space? During Kathleen Fitzpatrick’s visit, she spoke about the technicality of this, by collecting metadata and tagging. In Planned Obsolescence, she delves a more into looking at the structure of the original material and the digital environment, going beyond the ink to pixel conversion. The NYPL Labs guys echoed the same notion about structure, that a serendipitous discovery is surprising but not random because the data belongs in a structure and it is transparent how you arrived at your discovery. But they also questioned if this recreation of serendipity is in the power of the creator.
Stating your limitations of your project. Like scientists, we should the boundaries of our experiments, noting what was specifically included and excluded so it is not assumed that the results reflect all data (whatever that means). In the world of google and wikipedia, we need to be mindful of the constant creation and revision of knowledge. Even with tools for data scraping, we still need to question what is being left out and why.

They shared some great links. Here are some that I noted in case you wanted to revisit:

Drucker, Johanna. 2008. “The Virtual Codex from Page Space to E-space.” In A Companion to Digital Literary Studies, ed. Susan Schreibman and Ray Siemens, 216-32. Oxford: Blackwell.

The Freedom to Move

José Palazon/Reuters
African asylum seekers stuck on a razor wire fence, behind white-clad golfers teeing off on a golf course.

I started this data set project a couple of weeks ago, but it has taken me a while to post and share. I think I was looking to show something “more”- more complete, more thought out, more visually appealing, more theoretically/methodologically sound, etc. And then I realized that “more” probably won’t come if the project is sitting idly in Tableau, unseen by others. While I love the idea of collaborating, I’m still getting used to the idea of sharing incomplete things in the process. My lack of technological savvy also had me guarded for a while. Luckily, seeing everyone’s impressive progress and feedback to others have given me more courage to just share what I’ve done already.

In my income and inequality class, a great class filled with data, Professor Milanovic led me to a study on travel freedom, The Henley and Partners Visa Restrictions Index. I haven’t had the chance to do additional research on this for-profit organization, which I would conduct if I were to pursue this further. The survey received a substantial amount of media attention and Wikipedia based its “Visa requirements for United States citizens” page on this survey. I mention the media and wiki attention to this survey to not verify its reliability but to show the survey’s influence. In Prof. Milanovic’s class, we were looking at migration as a means of addressing global inequality (the income inequality between countries) but there are travel restrictions in many areas of the world that would benefit the most economically from migration. To make this connection clearer, I matched the country’s travel freedom ranking with its 2013 GDP, provided by the World Bank. Several other organizations, such as the UN, IMF, CIA, track GDP, but I decided to go with the World Bank’s numbers.

The visa restrictions indez was in a pdf with multiple columns, which required some text cleaning after pasting into a text editor, Notepad++. I had to learn about regular expressions in order to do this efficiently. Pretty simple, I had to replace SPACE with TAB, but I also had to keep in mind of multi-word nations like Central African Republic or United States. It wasn’t a clean find and replace at the end, so I had to clean up some things manually. But overall, a great tip I picked up at Micki’s data visualization class.

I then saved the file as a csv. I created another column to insert GDP data from the World Bank’s data repository, which is a great source of information. I downloaded a spreadsheet that included the GDP for each country for every year since 1980. I was interested in the latest, 2013 data. Inserting this information required some more manual work. I’m sure I could have used an Excel function, but after spending some time looking for that function, my impatience got the best of me and I decided to do it the not-so-quick and dirty way. I copied and pasted after I put the countries in alphabetic order. For the most part the naming conventions were the same, so it didn’t take very long. If I were to do this again, I would definitely figure out how to do this correctly, but I was didn’t want to lose my momentum.

So my data looks like this:

After combining the cleaning the Henley and Partners Visa Restriction Index data and World Bank 2013 GDP data

I decided to use Tableau to visualize this data. I wanted to highlight the geographical aspect of this data, as we are talking about visa and travel freedom. I thought it would be interesting to see where the clusters of countries with the highest travel freedom are in comparison to countries with the lowest travel freedom. I didn’t know how to show GDP simultaneously besides showing up in the bubble when you hover over the countries. Here is a snapshot of the map. You can go here for the “interactive” version, where you will see the GDP information.

You will notice that the countries with high GDP has the least amount of travel restrictions. The countries with the lowest GDP, the countries whose citizens would benefit the most from migration by taking available jobs and escaping political corruption in their home countries, have the least amount of travel freedom. So one would come to the conclusion that the current system of immigration is counterproductive in addressing global inequality.

The field of economics, as one would expect, is extremely data-heavy. Our professor would leave his Stata codes at the bottom of his slides in case we wanted to recreate them. As a non-economics major, the hard numbers/algorithms stuff made me a little nervous, but I was also excited in seeing these types of sociological patterns. Visualizing these patterns are important because of its potential to raise awareness and political activism. Letting them stay hidden in industry publications or esoteric economic conferences won’t do much good, but publicizing and presenting them in ways that will grab people’s attention, might. The media has been really flexing its data visualization muscles recently. That being said, I was really happy to hear about the GC’s potential course crossovers with the Journalism School. It might give me a chance to keep pursuing this data project.

-Juliana

Digital Praxis Seminar Fall 2014 – Spring 2015

Author Archives: JULIANA SON

Digital HUAC- Project Update

Digital HUAC Progress Report- Outreach Plan

Objective:

Target:

Objective #1- Consult with subject matter and technology experts.

Objective #2- Promote Digital HUAC to potential users, supporters, and adopters.

Approach

Objective #1:

Objective #2:

Pitch (the “voice”):

Objective #1:

Objective #2:

#skillset – Juliana

Big Data and the museum

NYPL Labs visit

The Freedom to Move

Need help with the Commons?