Tag Archives: Project Update

TANDEM Project Update

PROJECT:

TANDEM 0.5 will be moving from it’s heavy development phase into a testing and forward-facing design phase this week. At the time of this posting, Steve and Chris are still working out the specifics of functioning unified code, but testing of the independent scripts has begun to a certain degree of success. Text and image values are easily generated via independent processes.

This week we also discussed the idea of data persistence with some depth. Simply put, would someone be able to access the data they generated at a later date via the TANDEM ui. At this iteration of the software, we agree that this is a valuable component, but not an essential feature for an MVP. That said, we are thinking about both the code needed to run it and the user-specific UI that would accompany such an application.

DEVELOPMENT:

We are working away at unifying TANDEM’s independently functioning image and text codebases. We are aligning the code in a single python script file. We vetted an idea to have two scripts, one for image and one for text, run simultaneously. The decision was made that for this first iteration of TANDEM, a single .py will suffice, and in fact may be more maintainable and more efficient.

The code merge has been slow due to python versioning issues which lead to the code producing different results on different machines.

A call is scheduled for Tuesday with Tim at Reclaimhosting to work on configuring the server to run Django. Meanwhile the developer is working through the very thorough Django tutorial and also trying to begin the defined appropriate class objects for a potential future version.

DESIGN:

Immediately following the code merge, we plan to begin implementing our user interface. A full size mockup of this is still under version control as we explore new grounds with user-specific views and the possibility of in-browser table views of the .csv data that is generated.

OUTREACH:

TANDEM continues to reach new communities. We have a lead, thanks to Sarah Cohn. The Biodiversity Library is currently crowdsourcing their seed catalog archive project, and in advanced versions, TANDEM might improve their information collection. http://blog.biodiversitylibrary.org/2015/03/help-us-improve-access-to-seed-and.html Jojo will contact them once our prototype is more stable.

Additionally, Jojo spoke with Grant Wythoff, who reasserted TANDEM’s relevance to Bill Gleason’s project at the Cotsen Library. Jojo will reach out to Professor Gleason again this week. Grant also recommended we contact Natalie Houston at University of Houston regarding her Digital Victorian project on the visuality of poetry.

Jojo also attended the DjangoGirlsNYC event at the Stack Exchange. In addition to familiarizing herself with the framework in which TANDEM will eventually operate, she made useful contact with other django developers working in NYC.

DigitalHUAC Project Update

We’re presenting this week, so we don’t want to give away too much in this post. In short, however, this week was taxing, productive, gratifying, and exciting—in that order.

After deciding strong encouragement to use Document Cloud (DC) as our database and corpus warehouse, we spent the last few weeks working out how our site would talk to the DC API. No easy thing. What language would be used? How would the syntax work? Wrappers? Formatters? JSON? Above all, what are the relationships between the project’s front- and back-ends, and how does each programming decision/requirement bear these out? This involved much due diligence and scaling of learning curves. No matter: our team is made up of furious warriors, eager to storm the scene. We eat learning curves for breakfast and then throw the dishes out the damn window.

After being advised to lean on PHP (or, as we call it on conference calls, “Playas Hatin’ Python”) for our scripting needs, Daria spent the week shifting her programming language focus. This meant tying together guides and help from a number of sources—our professors, online learning hubs, GC Digital Fellows, outside gurus—drafting code, and testing it out with the rest of the group weighing in and helping troubleshoot. Our main focus this week was on connecting the search form that Juliana and Sarah put together earlier with our DC documents. We needed to be able to run a simple search—at the very least, to have a search form lead to a URL that would display the action that had just been carried out. This, we are happy to report, we have done.

Going back to the point about how the front- and back-ends interact, we also thought this week about what our search results pages would look like. This is of course important from a user experience aspect in terms of display and site navigability: indeed, part of the mission of the project is to organize scattered materials and make transparent an episode from American History shrouded in misinformation and melodrama. But there are just as many meaningful development calls as there are design decisions, and the further we get into the project, the better sense we’re getting for how these two are inextricably linked. We’ll talk more about this on Tuesday, but it’s nice to be at a point where we can think holistically about development and design. In a sense, this is where we started the project; the difference is that now we’re zeroing in on functionality, whereas at the beginning, we thought in terms of concepts.

Also, we received our Reclaim Hosting server space from Tim Owens and are now live on digitalhuac.com. We have yet to install Git on this server, so we are left to update our site like geezers: GitHub > Local Download > Filezilla. We hope to fix this soon and step up our command line game.

week 6 project update / TANDEM

Development

On the image processing side of things, Chris has identified the syntax for generating our key values. Now we are working toward stitching the pieces together in a way that makes sense for our output. The extreme minimum of computer vision is accessible via OpenCV and while the possibilities are tantalizing, we have continued to keep a direct focus on the key pieces we need to access for the mvp. TANDEM is still on track.

We have also begun to reevaluate our progress. To do so, we created a new list of dev tasks that range from bite-sized to larger steps so we can visualize how much further we have to go. Steve has been doing a great job of keeping track of progress and using git for version control of his scripts.

In addition we successfully implemented a routine to convert PDF to TXT. Input files are screened by type. If they are JPG, PNG or TIFF, they are passed to Tesseract for OCR processing. If they are PDF they are passed to a PDFMiner routine that extracts the text. In each case the program writes TXT files to “nltk_data/corpora/ocrout_corpus” with a name that matches the first order name of the input file. The latest version of the backend code is here: https://github.com/sreal19/Tandem

Web functionality remains problematic. Most effort this week has been merely trying to get through the Flask tutorial.

To end on a positive note, developmentally, good progress has been made with Text Analysis processing. We are computing the word count and average word length for a single page. The program also creates a complete list of words for each input file. In the very near future work will be completed to create a list of unique words and the count of each. The team must make a decision about whether to strip punctuation from the analysis, since many of the OCR errors are rendered as punctuation.

Design/UI/UX

We’ve been working to identify the ideal UX functionalities for javascript. Most of this was fairly straight-forward, such as giving the user the ability to browse local folders & view a progress bar of the upload/analysis. It has been difficult to locate a script to produce error messages. Searching for anything involving “error” in the name retrieves a different type of request, and “progress” only gets to half of the need.

For instance, we had discussed having the ability to let users identity upload/analysis errors by file, either with a prompt on the final screen or with indicator text in the CSV output. Such a feature will provide the user with the ability to go back and fix the error for 1 file, versus having to comb through the entire corpus and re-uploaded. An example of how this would look would be something this, with text & visual cues that indicate that which file needs review:

There is some documentation on Javascript progress events and errors, but we need to need to discuss how it could be employed for TANDEM, and whether its necessary for the 0.5 version.

Outreach

Twitter continues to be the primary platform for outreach. While #picturebookshare continues to chime away, we are also now using it to generate research ideas for potential TANDEM users. Fun distant futures for TANDEM might involve the visual trajectories of various aspects of books: visuality of covers or book spines, as well as the visual history of education materials.

Jojo spoke with Carrie Hintz, who has is starting a Childhood Studies track via the English Department, to see if she knew anyone studying illustrated books at the GC. She has no leads yet, but said come the fall she would have a better idea of people interested in TANDEM. Meanwhile, Long LeKhac, an English PhD at Stanford, was giving her a sense of the DH scene there and said he would ask around the DH community beyond Moretti’s lab. Jojo is in the process of devising outreach to text studies experts — Kathleen Fitzpatrick at MLA, Steve Jones — and folks in journalism — Nick Diakopoulos, NICAR and Jonathan Stray, per Amanda Hickman’s suggestion. Keep on keeping on — keep the tweets t(w)eeming.

Digital HUAC- Project Update

This week, our team found the answer to our biggest development hurdle- DocumentCloud. Prior to this discovery, we were trying to figure out how to create a relational database, which would store meta tags of our corpus, that would respond to user input in our website’s search form.

It turns out that DocumentCloud, with an Open Calais backend, is able to create semantic metadata from document uploads and can pull the entities within the text. The ability to recognize entities (places, people, organizations) is particularly helpful for our project since these would be potential search categories. We are also able to create customized search categories through DocumentCloud by creating key value pairs. On Tuesday, we uploaded our 5 HUAC testimonies and started to create key value pairs, which are based on our taxonomy. (Earlier this week, we finalized our taxonomy after receiving feedback on our taxonomy from Professor Schrecker at Yeshiva University and Professor Cuordileone at CUNY City Tech.) In order to create these key value pairs, we had to read through each transcript and pull our answers, like this:

Field	Notes & Examples	Rand	Brecht	Disney	Reagan	Seeger
Hearing Date	year-mo-day, 2015-03-10	1947-10-20	1947-10-30	1947-10-24	1947-10-23	1955-08-18
Congressional Session number		80th	80th	80th	80th	84th
Subject of Hearing		Hollywood	Hollywood	Hollywood	Hollywood	Hollywood
Hearing location	City, 2 letter state	Washington, DC	Washington, DC	Washington, DC	Washington, DC	New York, NY
Witness Name	Last Name, First Middle	Rand, Ayn	Brecht, Bertolt	Disney, Walt	Reagan, Ronald W.	Seeger, Pete
Witness Occupation	or profession	Author	Playwright	Producer	Actor	Musician
Witness Organizational Affiliation				Walt Disney Studios	Screen Actors Guild	People’s Songs
Type of Witness	Friendly or Unfriendly	Friendly	Unfriendly	Friendly	Friendly	Unfriendly
Result of appearance	contempt charge, blacklist, conviction		Blacklist			Contempt charge, but successfully appealed; Blacklist

With DocumentCloud thrown back into the mix, we had to take a step back and start again with site schematics. We discussed each step of how the user would move through the site, down to the click, and how the backend would work to fulfill the user input in the search form. (Thanks, Amanda!) In terms of development, we will need to create a script (Python or PHP) that will allow the user’s input in the search box to “talk” to the DocumentCloud API and pull the appropriate data.

Amanda mentioned DocumentCloud to us a while ago, but our group thought it was more of a repository than a tool, so our plan was to investigate it later, after we figured out how to build a database. After hounding the Digital Fellows for the past couple of weeks on how to create a relational database, they finally told us, “You need to look at DocumentCloud.” Moral of the story: Question what you think you know.

On the design front, we started working in Bootstrap and have been experimenting with Github. We were able to push a test site through Github pages, but we still need to work on how to upload the rest of our site directory. This is our latest design of the site:

CUNYcast Update week 5

SHOUT IT OUT WITH CUNYcast!

CUNYcast is moving forward and expanding our knowledge of the technical requirements involved in online radio broadcast. This week major strides were taken in outreach and development.

Contact was made with support and specialty knowledge in online radio broadcast technology (Mikhail Gershovich)
Reclaim hosting server space was finalized
Icecast and Artime were uploaded to server space.

THE MINIMUM VIABLE PRODUCT (MVP)

CUNYcast is a live online radio website offering students an opportunity to stream audio using original content from classes, lectures, and projects. CUNYCast’s aim is to empower a DH guerrilla broadcast community.

CUNYcast will reach out to the GC through an academic commons page that will link users, listeners, and curious DHers to our CUNYcast web presence. The CUNYcast web page will have a space for listeners to listen to the live streaming CUNYcast content. It will have a space where users may learn how to access the CUNYcast live stream and upload their own content. CUNYcast is designed to inform and inspire its users, to facilitate this experience CUNYcast’s web page will house a manual that will empower user’s to add their own content to the CUNYcast live streaming radio and inform them on how they could create their very own digital live stream radio channel. A portion of the manual will help users learn how to create their own audio content if the wish to explore a more polished radio stream format.

As an added bonus the CUNYcast website will have links to educational audio content and pedagogy surrounding teaching practices that utilize audio creation as mode of production.

Technical specifications for MVP

The technical map of CUNYcast lays in the Icecast media server and the Airtime client used to manage media on the media server. This back end structure will be given its public face on our website and our cuny commons presence.

Outreach: Report of activities to date

How To Succeed Even When You Fail

Spring semester 2015. Our Digital Humanities class broke into teams. We were only mildly anxious. Like the television show, “Shark Tank” which features new pitches for products and services each week, we were convinced our ideas were sound and that we could excel. The thing was, within just a few days we started to drown. Instead of devouring the material and spitting it back out for human consumption, we started sinking in a sea of possibilities. No tech geeks on our team. Just dreamers. That didn’t stop us from grabbing at every idea that seemed to float.

But, wait, our group of four people diminished to only three by week two. Man down. He disappeared and dropped the class (we wished him well). The three of us had to take a good hard look at the CUNYcast concept and decide what would assure our chances of survival. (Think of the music to Jaws playing underneath these words).

We took our overblown idea of a RSS-feed calendar linked into the CUNY system, that would record remotely via an app, after two afternoons of staring at code and realizing that by the time the project was due, we’d maybe have gotten through a couple of introductory tutorials. There was no way any of us would be coding experts in 12 weeks.

We trimmed the fat. Bit back with strength and vigor, and began on the current instantiation of CUNYcast: a live online radio website offering students an opportunity to stream audio using original content from classes, lectures, and projects. Our professors urged us to aim outside the box and empower an entire DH guerrilla broadcast community at the Graduate Center. Reporting in on week 4 and things are going swimmingly. We’ve gelled as a team and we’re optimistic.

We are not afraid anymore We are not afraid anymore – even if we should be.

Development:

This week, the goal was to configure an Icecast Media server in a local environment.Airtime and Icecast were configured on our server when we received the server configuration thanks to Reclaim hosting.

Icecast is (again) a media server. When you have an online radio station, the media server is where the audio/video lives for the duration of the stream, sort of an intermediary between the streamer (host machine) and the watcher (listener). Airtime is sort of a GUI that gives a face to the media server. Not only does it make the media server friendlier, it also makes it prettier. Airtime comes with a calendar that allows shows to be planned in advance.

One interesting thing about media servers, is that if someone has the access information to an Icecast server (ds106 allows their’s to be public, as will we, that’s kind of the point) they can use broadcasting programs to take over the station. If another person tries to take over the station when a show is going on, they’ll be met with an error. Airtime simplifies this with the above-mentioned calendar feature, as it allows users to see when shows are planned, and as such, schedule their planned broadcasts around that. Of course, this also allows for anarchy…

Bugs! The Icecast Server worked perfectly. We were able to access it via broadcasting software (Mixxx) and pick up that broadcast via VLC and browser (the address currently being cunycast.net:8000/live, kind of ugly) . However, Airtime specifically had some trouble connecting to our Icecast server, even after multiple troubleshooting attempts. When transmitting via Airtime, a connection could be established to the Icecast server for roughly ~10 seconds before falling flat, despite Airtime claiming the show was still airing. I hate it when machines lie to me. Anyway, after doing some GoogleFu I came across a thread on the SourceFabric forums (SourceFabric developed Airtime) about this exact problem. The fix stated in the thread claimed that I needed to restart certain Airtime services via commandline using the “sudo” command. Sounds scary. Because Airtime was installed for us, I was a little worried about messing it up, fearing that I would have to reinstall things that I do not understand. However, we were able to fix the bug more easily, by switching the broadcasting format from OGG Vorbis to simpler MP3 format.

development goals include:

Figure out how to interact with Airtime via command line (need help from Digital Fellows here)
Bring the backend media server to the front ASAP such that we have a simpler/prettier way for users to tune in.
Implement an AutoDJ to play over the station and maintain it when no broadcasts are coming in (this is where we may need to talk to a ds106 person).
Determine how incoming users will be able to manipulate/interact with Airtime.

Design:

Slow progress is being made constructing the structure and elements to the CUNYcast web presence using Bootstrap. The pre organized Java and CSS allows for immediate product but there is still a bit to understand about the addition of and linking to media.

The CUNYcast Academic commons site is being designed to mirror the CUNYcast website.

The guide on how to create websites is being updated to make sure that the CUNYcast manual evolves as the project evolves.

TANDEM Project Update Week 5

Excitement!

Team TANDEM is working fast and furiously on all fronts. We’ve hit a few snags but all told, we feel like we’ve got a handhold on the mountains we’re climbing. Here’s a brief overview of the ups and downs of the week:

Our hope we might springboard off Lev’s tool proved somewhat castles in the air. Lev’s feature extractor was coded in a day. When they went to try to run it again later they couldn’t. Lev suggested we use OpenCV instead.
OpenCV seems to be a massive and constantly shifting morass of dependencies.
Jojo attended a couple of talks from Franco Moretti and spoke with him afterwards to see if anyone at Stanford was doing anything similar. While he acknowledged the validity of studying text as image, he seems to show no further interest. Bummer, but his loss.

The Details

In development we’ve got a working program for OCR and NLTK (Go Steve!), and we’re making strides in OpenCV (Go Chris!). Lev suggested that we have two different types of picture books for our test corpus — one that’s rich in color, another that’s more gray-scale with more text. These corpus variations will show the range of data values available to future users of TANDEM. Kelly’s working on scanning an initial test corpus now.

We also have our hosting set up with Reclaim thanks to Tim, as well as a forwarding email domain. Go ahead and send us an email to [email protected].

In design/UI/UX Kelly has been working on variations of a brand identity, for color schemes, logo, web design elements… all of it (Go Kelly!). The UI is primed to go now that we have hosting for TANDEM. Kelly is currently working on identifying the code for the specific UI elements desired, for an “ideal world” situation. The next steps for design/UI/UX are to pick a final brand image, and apply it to all our outreach initiatives.

In outreach, TANDEM had a good meeting with Lev on Wednesday. He seems to think we’re doing something other DHers aren’t quite doing. We’re not yet convinced that it’s not just because it’s crazy hard. Either way, we’re up for it. Otherwise, Jojo has been working it hard (Go Jojo!) on all outreach fronts. This week we received interest from Dr. Bill Gleason at Cotsen Children’s Library at Princeton, where they’re working on ABC book digitization and seem especially interested in our image analysis. This response is proof of relevance in the field.

In regards to social media, we now have a proper twitter handle, which we will admit happened in the middle of last week’s class thanks to some pressure from Digital HUAC already having one. You can follow us @dhTANDEM. More on Twitter: TANDEM had a couple really useful retweets (hurray Alex Gil, massively connected Columbia DHer!) that generated some traffic on our website (jetpack has us at 138 views so far, which is not a ton, but it’s a start!) and has won us some good DH followers — @NYCDH, @trameproject. We’ve transferred #picturebookshare to the @dhTANDEM account, and inviting our followers to participate, as well as use it as a means to suggest additional items for our test corpus.

THE MINIMUM VIABLE PRODUCT (MVP)

TANDEM_wireframe

MVP version #1

Because TANDEM is leveraging tools that already exist, one very basic minimum deliverable is that TANDEM makes OCR, NLTK, and OpenCV easy to use. Moreover, if TANDEM itself is not easy to use, there is no inherent advantage in using TANDEM over simply installing the existing tools and running them.

TANDEM as this minimum deliverable would solve the issue of having these tools in a web based environment, relieving users of the laborious headache of installing the component elements. Even after installing the component elements a user would likely have to write code to obtain the required output. TANDEM will shield the user from that need to be a programmer. At this minimum deliverable, TANDEM has not wrapped the three together into a single output.

MVP version #2

A second, more advanced minimum viable product would be to have a website on which a person could upload high resolution TIFF files, press a “run TANDEM” button, and receive a .CSV document containing the core (minimum) output.

The minimum output will consist of six NLTK values (average word length, word count, unique word count, word frequency (excluding stop words), bi-grams and tri-grams) and three image statistics for each input page provided by the user. We hope to expand the range of file types that we can support and to improve the quality of our OCR output, as well as build more elaborate modules for feature detection in in both text and illustrations. However, we contend that demonstrating the comparative values of a couple corpora of picture books will prove that there is relevant information to be found across corpora with heavy image content.

MVP #3

We are shooting for a single featured MVP. A user comes to www.dhTANDEM.com and uploads a folder of image files following the printed instructions on the screen. They are prompted to hit an “analyze” button. After a few moments, a downloadable file is generated containing OCR-ed text, key data points from the OCR-ed text, and key feature descriptors from the overall image. This is purely for an early adopter looking to generate some useful data so that they can continue working on their story and/or data visualization.

Concerns

Overall things are going swimmingly. But of course there are concerns. This weeks concerns range from:

Can we get the OpenCV to do what we need in the time we have available? This seems to be the element that people really want — the visuality of illustrated print.
Will be able to scale the project to process the number of pages we would need for users to get the results that would prove TANDEM’s value?

These are, of course, huge questions. But to put it all in perspective: Stephen Zwiebel told Kelly this week that DH Box was held together “by tape” at the time of the final project presentations, and that it has had a lot more time in the past 9 months to become stable. Not to say that we aren’t looking to have a (minimally viable) product come May, but it’s a good feeling to know where other groups were last year. Should we be sharing that widely with the class? Well, we just did. 🙂

THANKS FOR FOLLOWING @dhTANDEM!

a bicycle built by four, for two (text AND image)

Digital HUAC – Workplan & Wireframe & Update

Wireframe:

Workplan:

Workplan: what & why

Workflow

The documents (which are already scanned) will be manually tagged using an XML editor according to identified categories, then read into an open-source relational database (MySQL), which reads XML documents. The MySQL database will be incorporated into the website using PHP in conjunction with the site (syntax—PHP within the HTML/CSS site schema). Finally, the API will allow users to export their searches to text-analysis resources.

Historians and Corpus

We’ve identified a number of historians, librarians and archivists, and digital humanists to potentially work with on this project and are in the process of reaching out to them in an advisory capacity. We seek guidance on our taxonomy and controlled vocabularies in the short term, and on future developments of our project beyond the scope of this semester.

At the top of this list are historians Blanche Cook and Josh Freeman, CUNY professors and experts on the HUAC era. Steve Brier is in the process of introducing us to both Cook and Freeman. Other historians include Ellen Schrecker (Yeshiva), Mary Nolan (NYU), Jonathan Zimmerman (NYU), and Victoria Phillips (Columbia), each with subject expertise and research experience on the time, events, and people central to Digital HUAC. We have also identified Peter Leonard, a DH librarian at Yale; David Gary, the American History subject specialist at Yale who holds a PhD in American History from CUNY; John Haynes, a historian who served as a specialist in 20th-century political history the Manuscript Division of the Library of Congress; and Jim Armistead and Sam Rushay, archivists at the Truman Library, as potential advisors.
We have narrowed down the corpus of text that we’ll be working with to include 5 transcripts: Bertold Brecht; Ronald Reagan; Ayn Rand; Pete Seeger; and Walt Disney. This list of major cultural figures spans the hearings themselves and features both friendly and hostile witnesses, offering users a varied look into the nuances of interrogation. It is our opinion that by focusing on a witness base of recognizable figures that is thematically organized, users may examine their testimony as individuals and in context with one another. This quality of the HUAC hearings cannot be understated, and Digital HUAC seeks to draw attention to it through the overall user experience.

Digital Praxis Seminar Fall 2014 – Spring 2015