Category Archives: Spring 2015

Tandem Team Report Week 7

PROJECT:

With our corpus defined and development goals set, the team is taking a two-pronged approach to the reaching the final project. While Chris and Steve focus on continuing to develop and code the working project, Kelly and Jojo have turned their attention to the work to be done with the corpus. Equally as important as building TANDEM is the ability to show a proof-of-concept and illustrate the value of the output TANDEM generates. While the duties among the team will still bleed as there is still design work that may arise for Kelly, outreach to be done by Jojo, and theoretical questions for Chris and Steve to weigh in on, our focus is much more pointed on particular pieces of achieving a functioning and valuable tool and methodology.

 

DEVELOPMENT:

A key milestone was reached this week when the text processing backend coding was completed. It will need to be thoroughly tested which the team expects to complete by 3/31. Additional work requires that the program be merged with the image processing code. This integration step is targeted for completion on 3/24. The current version of the program can be found on Github. The repo contains a number of test data files as well as documentation. The core program is TandemText.py.

The team decided to abandon the Flask web framework in favor of Django, primarily because there is much more local support (from the Digital Fellows) for Django. We were able to switch because we did not have a significant code base built in Flask and much of the work done on Flask may transfer well to Django. Optimistically, the team should be able to get a pilot “Hello World” application running under Django on the Reclaimhosting.com server (with help from Zach Davis and Tim Owens).

Finally, on the development front, the team needs to envision and plan for how we will persist data on the website. Will persistence even be see as a valuable feature by the users? If so, how will store and secure the data? How will we handle requests to amend or edit an existing result set? These decisions are pending, likely to be addressed at the 3/24 class.

 

 

OUTREACH:

This week TANDEM has maintained its twitter activity. Jojo is also working on reaching out to new communities while developing useful skills — she has taken on work at the Tow Center for Digital Journalism and is exploring possible applications of TANDEM there, and she got accepted to Django Girls next weekend and was assigned her team. She looks forward to meeting a number of people across disciplines and fields.

 

Trouble in Paradise

The internet is a paradise of open-source code, shared information, and technology tutorials. Right?

But, what about the obstacles we’ve all come up against? Whether problems present a steeper learning curve than we expect, a tiny piece of missing information, or a techie dead end, these issues can result in lack of sleep, brain overload, or simply “giving up.” Our team has been working diligently throughout the last many weeks on three major fronts:

  • Identifying exactly what our project is, and how to model it.
  • Grappling with the technology in all its various aspects: broadcast & web building.
  • Overcoming what appears to be inherent issues with the systems we’re working with.

We are presenting in class this week. So, we’ll share many of the details of CUNYcast’s progress then. But, for the weekly blog we’ll divulge one particularly annoying problem.

When you contact us to start a cast, we provide a url. Making an attempt to create a broadcast using the information we currently provide leads to one of two things:

www.CUNYcast.net leads to:CUNYcast_NAA search for CUNYcast.net leads to:

CUNYcast_safe?

Which will prompt a potential CUNYcast user to contact our developer who will respond with kind encouragement to just ignore the warnings and go ahead (see the nice red arrow):

unnamed

For potential casters this might be a sign to give up or walk away. The lack of security certificate is a problem. We’ve made inquiries but we just don’t know how to fix it yet. Until we do, inviting potential casters to get on board might be a mistake. We don’t want to turn them off before they begin, and ultimately many of the casters may not be techie types at all. The whole process has gotta be super easy. That’s why we’re working so hard– so it can be super smooth to Shout It Out with CUNYcast in the future.

NEVER GIVE UP!
@CUNYcast #CUNYcast #DHPraxis14

 

Fashion Index Weekly Update

Screenshot_2015-03-10-16-54-23-1

Currently, Instagram carries more than a million images of #sprezzatura.  We should start learning API and MySQL for data mining. To solve the technology parts, we visited Digital fellows at the media lab. My SQL  fundamentally interact with data and establish the connection between the server. One of the digital fellows Even said we need to set up the local copy of My SQL. He also suggested to learn API Python. I will look over Python Library as well.

The developer, Tessa approached Shelley Bernstein at the Brooklyn Museum, but we could not access her. Anyway, her tagging project was very useful.

Limitation: Instagram API does not offer geospatial information. In this stage, we should find third party applications for the mapping.

For coding, we need text wrangler which functions as text editor, program scripts, and code translator. The outreach, Renzo is using Filezilla for coding. Also, Renzo and Minn (Designer) started working on mapping. They installed the program called R and GGmap. Those two programs interact with google map and import images from google map. Also they provide information of longitude and latitude based on the google map.

 

 

Digital HUAC- Project Update

This week, our team found the answer to our biggest development hurdle- DocumentCloud. Prior to this discovery, we were trying to figure out how to create a relational database, which would store meta tags of our corpus, that would respond to user input in our website’s search form.

It turns out that DocumentCloud, with an Open Calais backend, is able to create semantic metadata from document uploads and can pull the entities within the text. The ability to recognize entities (places, people, organizations) is particularly helpful for our project since these would be potential search categories. We are also able to create customized search categories through DocumentCloud by creating key value pairs. On Tuesday, we uploaded our 5 HUAC testimonies and started to create key value pairs, which are based on our taxonomy. (Earlier this week, we finalized our taxonomy after receiving feedback on our taxonomy from Professor Schrecker at Yeshiva University and Professor Cuordileone at CUNY City Tech.) In order to create these key value pairs, we had to read through each transcript and pull our answers, like this:

Field Notes & Examples Rand Brecht Disney Reagan Seeger
Hearing Date year-mo-day, 2015-03-10 1947-10-20 1947-10-30 1947-10-24 1947-10-23 1955-08-18
Congressional Session number 80th 80th 80th 80th 84th
Subject of Hearing Hollywood Hollywood Hollywood Hollywood Hollywood
Hearing location City, 2 letter state Washington, DC Washington, DC Washington, DC Washington, DC New York, NY
Witness Name Last Name, First Middle Rand, Ayn Brecht, Bertolt Disney, Walt Reagan, Ronald W. Seeger, Pete
Witness Occupation or profession Author Playwright Producer Actor Musician
Witness Organizational Affiliation Walt Disney Studios Screen Actors Guild People’s Songs
Type of Witness Friendly or Unfriendly Friendly Unfriendly Friendly Friendly Unfriendly
Result of appearance contempt charge, blacklist, conviction Blacklist Contempt charge, but successfully appealed; Blacklist

With DocumentCloud thrown back into the mix, we had to take a step back and start again with site schematics. We discussed each step of how the user would move through the site, down to the click, and how the backend would work to fulfill the user input in the search form. (Thanks, Amanda!) In terms of development, we will need to create a script (Python or PHP) that will allow the user’s input in the search box to “talk” to the DocumentCloud API and pull the appropriate data.

websiteschematics

Amanda mentioned DocumentCloud to us a while ago, but our group thought it was more of a repository than a tool, so our plan was to investigate it later, after we figured out how to build a database. After hounding the Digital Fellows for the past couple of weeks on how to create a relational database, they finally told us, “You need to look at DocumentCloud.” Moral of the story: Question what you think you know.

On the design front, we started working in Bootstrap and have been experimenting with Github. We were able to push a test site through Github pages, but we still need to work on how to upload the rest of our site directory. This is our latest design of the site:

dhuac

THE CUNYCAST COMMONS GROUP IS NOW OPEN TO ALL

Welcome to our world. The CUNYcast Commons Group is now open to all!
Shout it out. #CUNYcast

We started this project in earnest weeks ago. But, looking back to March 1st when we posted our second process report for the DHpraxis class (after we went from a four person team to a three person team), we have definitely made headway. (Bad Shark. Go Away).

We are ready to start opening up the group to others who may be interested: 1. in signing up to produce future CUNYcasts, 2. techie-types who may want to submit their opinions as we build out this platform (Speak now or forever hold your peace). We are currently working in Bootstrap to configure our WordPress site and hope to launch by the week of April 1st.

Yesterday our site template went from this (see below) to next slide. Julia and Joy worked in tandem (get it; ha.. ha- we luv you classmates :)) on several agenda items:

Web_Page

To this:

unnamed

Next we have to wrestle with remedying the code for the header on the new pages. Go to our CUNYcast Commons Group to view this daunting (and hysterical) code.
[or you can download FULL DOCUMENT w/HTML CODE HERE]

James configured widgets and generally dove into Icecast and airtime.

Widgets

We have a custom icon now too:

CUNY_cast_Logo_2

We got a little blog action goin’ on by way of a shout out here too.

Next up, we gotta get the website uploaded to the server, make some adjustments to the pages, and get the flow between airtime and our site ironed out. These are our major goals right now.

Fashion Index Weekly Update

Overall, our team is focusing on updating the webpage and learning new technologies including coding and programming. WordPress doesn’t fit to our project.

NYC_Fashion_Index_Prototype_index

  • Currently, our core goal is to generate HTML. Our team is working on the HTML coding and CSS.
  • Our team chose to use bootstrap and Bootstrap Jumbotron.
  •  

    On the learning process of using grid system (need to measure the accurate ratio)

  •  

    Learned how to link social media icons (Right Click the image and save -> coding in HTML language-> copy and paste the file name on <image>, </image>)  eg: <img src=”icons/tumblr.png” alt=”tumblr icon” />

  • Setting custom styles for font, background color.
  • Screen Shot 2015-02-24 at 5.06.52 AM
  •  

    Trying to look our social media including Instagram, twitter more active.

  •  

    Followed several accounts related to fashion industries and sprezzatura.

  • Screenshot_2015-03-14-19-01-01-1
  •  

    Followed people those who are currently attending Parsons, Pratt, and FIT and already graduated from those schools. Fortunately, our follower number increased from 2 to 15 people.

  •  

    Best examples from the project done in Brooklyn Museum. The project invited uses to add tags below images.

    Style

  •  

    The instagram account is process-based. We are planning to share our progress by a picture-taking method.

  • The website will be content-based. Concentrate on documenting image archives.

 

CUNYcast Update week 5

SHOUT IT OUT WITH CUNYcast!

CUNYcast is moving forward and expanding our knowledge of the technical requirements involved in online radio broadcast. This week major strides were taken in outreach and development.

  • Contact was made with support and specialty knowledge in online radio broadcast technology (Mikhail Gershovich)
  • Reclaim hosting server space was finalized
  • Icecast and Artime were uploaded to server space.

THE MINIMUM VIABLE PRODUCT (MVP)

CUNYcast is a live online radio website offering students an opportunity to stream audio using original content from classes, lectures, and projects. CUNYCast’s aim is to empower a DH guerrilla broadcast community.

CUNYcast will reach out to the GC through an academic commons page that will link users, listeners, and curious DHers to our CUNYcast web presence. The CUNYcast web page will have a space for listeners to listen to the live streaming CUNYcast content. It will have a space where users may learn how to access the CUNYcast live stream and upload their own content. CUNYcast is designed to inform and inspire its users, to facilitate this experience CUNYcast’s web page will house a manual that will empower user’s to add their own content to the CUNYcast live streaming radio and inform them on how they could create their very own digital live stream radio channel. A portion of the manual will help users learn how to create their own audio content if the wish to explore a more polished radio stream format.

As an added bonus the CUNYcast website will have links to educational audio content and pedagogy surrounding teaching practices that utilize audio creation as mode of production.

Technical specifications for MVP

The technical map of CUNYcast lays in the Icecast media server and the Airtime client used to manage media on the media server. This back end structure will be given its public face on our website and our cuny commons presence.projectmap03-02-2015

Outreach: Report of activities to date

How To Succeed Even When You Fail

Spring semester 2015. Our Digital Humanities class broke into teams. We were only mildly anxious. Like the television show, “Shark Tank” which features new pitches for products and services each week, we were convinced our ideas were sound and that we could excel. The thing was, within just a few days we started to drown. Instead of devouring the material and spitting it back out for human consumption, we started sinking in a sea of possibilities. No tech geeks on our team. Just dreamers. That didn’t stop us from grabbing at every idea that seemed to float.

But, wait, our group of four people diminished to only three by week two. Man down. He disappeared and dropped the class (we wished him well). The three of us had to take a good hard look at the CUNYcast concept and decide what would assure our chances of survival. (Think of the music to Jaws playing underneath these words).

We took our overblown idea of a RSS-feed calendar linked into the CUNY system, that would record remotely via an app, after two afternoons of staring at code and realizing that by the time the project was due, we’d maybe have gotten through a couple of introductory tutorials. There was no way any of us would be coding experts in 12 weeks.

We trimmed the fat. Bit back with strength and vigor, and began on the current instantiation of CUNYcast: a live online radio website offering students an opportunity to stream audio using original content from classes, lectures, and projects. Our professors urged us to aim outside the box and empower an entire DH guerrilla broadcast community at the Graduate Center. Reporting in on week 4 and things are going swimmingly. We’ve gelled as a team and we’re optimistic.

We are not afraid anymore We are not afraid anymore – even if we should be.

Development:

This week, the goal was to configure an Icecast Media server in a local environment.Airtime and Icecast were configured on our server when we received the server configuration thanks to Reclaim hosting.

Icecast is (again) a media server. When you have an online radio station, the media server is where the audio/video lives for the duration of the stream, sort of an intermediary between the streamer (host machine) and the watcher (listener). Airtime is sort of a GUI that gives a face to the media server. Not only does it make the media server friendlier, it also makes it prettier. Airtime comes with a calendar that allows shows to be planned in advance.

One interesting thing about media servers, is that if someone has the access information to an Icecast server (ds106 allows their’s to be public, as will we, that’s kind of the point) they can use broadcasting programs to take over the station. If another person tries to take over the station when a show is going on, they’ll be met with an error. Airtime simplifies this with the above-mentioned calendar feature, as it allows users to see when shows are planned, and as such, schedule their planned broadcasts around that. Of course, this also allows for anarchy…

Bugs! The Icecast Server worked perfectly. We were able to access it via broadcasting software (Mixxx) and pick up that broadcast via VLC and browser (the address currently being cunycast.net:8000/live, kind of ugly) . However, Airtime specifically had some trouble connecting to our Icecast server, even after multiple troubleshooting attempts. When transmitting via Airtime, a connection could be established to the Icecast server for roughly ~10 seconds before falling flat, despite Airtime claiming the show was still airing. I hate it when machines lie to me. Anyway, after doing some GoogleFu I came across a thread on the SourceFabric forums (SourceFabric developed Airtime) about this exact problem. The fix stated in the thread claimed that  I needed to restart certain Airtime services via commandline using the “sudo” command. Sounds scary. Because Airtime was installed for us, I was a little worried about messing it up, fearing that I would have to reinstall things that I do not understand. However, we were able to fix the bug more easily, by switching the broadcasting format from OGG Vorbis to simpler MP3 format.

development goals include:

  • Figure out how to interact with Airtime via command line (need help from Digital Fellows here)
  • Bring the backend media server to the front ASAP such that we have a simpler/prettier way for users to tune in.
  • Implement an AutoDJ to play over the station and maintain it when no broadcasts are coming in (this is where we may need to talk to a ds106 person).
  • Determine how incoming users will be able to manipulate/interact with Airtime.

Design:

Slow progress is being made constructing the structure and elements to the CUNYcast web presence using Bootstrap. The pre organized Java and CSS allows for immediate product but there is still a bit to understand about the addition of and linking to media.

The CUNYcast Academic commons site is being designed to mirror the CUNYcast website.

CUNYcast

The guide on how to create websites is being updated to make sure that the CUNYcast manual evolves as the project evolves.

Digital HUAC: MVP Post

Over the course of this project so far, and in relation to the feedback that we’ve been receiving, we have scaled up and down our goals and expectations. It has been both humbling and empowering to consider everything we can do within the constraints of a single semester project. When asked to brainstorm our minimum viable product (MVP) this week, over a conference call we all agreed on the following:

– a central repository with basic search functionality that stores our corpus of 5 transcripts.

– a database that can be scaled.

What does this mean, and how does it differ from our current project goals?

We are attempting to generate a platform that connects a relational database to a robust search interface and utilizes an API to allow users to extract data. We envision Digital HUAC to be the start of a broader effort to organize HUAC transcripts and allow researchers and educators access to their every character. By allowing advanced searches driven by keywords and categories, we seek to allow users to drill down into the text of the transcripts.

Our MVP focuses on storing the transcripts in a digital environment that returns simple search results: in absence of a robust search mechanism, users would instead receive results indicating, for example, that a sought after term appeared in a given transcript and not much more.

Our MVP must lay out a model for a scalable database. We are still very much figuring out exactly how our database will operate, so it is hard to fully commit to what even a pared-down version of this would look like. But we know that the MVP version must work with plain text files as the input and searchable files as the output.

Generating an MVP has been a useful thought experiment. It has forced us to hone in on the twin narrative and technical theses of this project: essentially, if everything else was stripped away, what must be left standing. For us, this means providing basic search results and a working model of a relational database that, given appropriate time and resources, could be expanded to accommodate a much greater corpus.

TANDEM Project Update Week 5

TANDEM_big logo

Excitement!

Team TANDEM is working fast and furiously on all fronts. We’ve hit a few snags but all told, we feel like we’ve got a handhold on the mountains we’re climbing. Here’s a brief overview of the ups and downs of the week:

  • Our hope we might springboard off Lev’s tool proved somewhat castles in the air. Lev’s feature extractor was coded in a day. When they went to try to run it again later they couldn’t. Lev suggested we use OpenCV instead.
  • OpenCV seems to be a massive and constantly shifting morass of dependencies.
  • Jojo attended a couple of talks from Franco Moretti and spoke with him afterwards to see if anyone at Stanford was doing anything similar. While he acknowledged the validity of studying text as image, he seems to show no further interest. Bummer, but his loss.

The Details

In development we’ve got a working program for OCR and NLTK (Go Steve!), and we’re making strides in OpenCV (Go Chris!). Lev suggested that we have two different types of picture books for our test corpus — one that’s rich in color, another that’s more gray-scale with more text. These corpus variations will show the range of data values available to future users of TANDEM. Kelly’s working on scanning an initial test corpus now.

BookScanCenter_8

We also have our hosting set up with Reclaim thanks to Tim, as well as a forwarding email domain. Go ahead and send us an email to dhtandem@gmail.com.

In design/UI/UX Kelly has been working on variations of a brand identity, for color schemes, logo, web design elements… all of it (Go Kelly!). TANDEM_Circle GlyphsThe UI is primed to go now that we have hosting for TANDEM. Kelly is currently working on identifying the code for the specific UI elements desired, for an “ideal world” situation. The next steps for design/UI/UX are to pick a final brand image, and apply it to all our outreach initiatives.

In outreach, TANDEM had a good meeting with Lev on Wednesday. He seems to think we’re doing something other DHers aren’t quite doing. We’re not yet convinced that it’s not just because it’s crazy hard. Either way, we’re up for it. Otherwise, Jojo has been working it hard (Go Jojo!) on all outreach fronts. This week we received interest from Dr. Bill Gleason at Cotsen Children’s Library at Princeton, where they’re working on ABC book digitization and seem especially interested in our image analysis. This response is proof of relevance in the field.

In regards to social media, we now have a proper twitter handle, which we will admit happened in the middle of last week’s class thanks to some pressure from Digital HUAC already having one. You can follow us @dhTANDEM. More on Twitter: TANDEM had a couple really useful retweets (hurray Alex Gil, massively connected Columbia DHer!) that generated some traffic on our website (jetpack has us at 138 views so far, which is not a ton, but it’s a start!) and has won us some good DH followers — @NYCDH, @trameproject. We’ve transferred #picturebookshare to the @dhTANDEM account, and inviting our followers to participate, as well as use it as a means to suggest additional items for our test corpus.

THE MINIMUM VIABLE PRODUCT (MVP)

TANDEM_wireframe

MVP version #1

Because TANDEM is leveraging tools that already exist, one very basic minimum deliverable is that TANDEM makes OCR, NLTK, and OpenCV easy to use. Moreover, if TANDEM itself is not easy to use, there is no inherent advantage in using TANDEM over simply installing the existing tools and running them.

TANDEM as this minimum deliverable would solve the issue of having these tools in a web based environment, relieving users of the laborious headache of installing the component elements. Even after installing the component elements a user would likely have to write code to obtain the required output. TANDEM will shield the user from that need to be a programmer. At this minimum deliverable, TANDEM has not wrapped the three together into a single output.

MVP version #2

A second, more advanced minimum viable product would be to have a website on which a person could upload high resolution TIFF files, press a “run TANDEM” button, and receive a .CSV document containing the core (minimum) output.

The minimum output will consist of six NLTK values (average word length, word count, unique word count, word frequency (excluding stop words), bi-grams and tri-grams) and three image statistics for each input page provided by the user. We hope to expand the range of file types that we can support and to improve the quality of our OCR output, as well as build more elaborate modules for feature detection in in both text and illustrations. However,  we contend that demonstrating the comparative values of a couple corpora of picture books will prove that there is relevant information to be found across corpora with heavy image content.

MVP #3

We are shooting for a single featured MVP. A user comes to www.dhTANDEM.com and uploads a folder of image files following the printed instructions on the screen. They are prompted to hit an “analyze” button. After a few moments, a downloadable file is generated containing OCR-ed text, key data points from the OCR-ed text, and key feature descriptors from the overall image. This is purely for an early adopter looking to generate some useful data so that they can continue working on their story and/or data visualization.​

Concerns

Overall things are going swimmingly. But of course there are concerns. This weeks concerns range from:

  • Can we get the OpenCV to do what we need in the time we have available? This seems to be the element that people really want — the visuality of illustrated print.
  • Will be able to scale the project to process the number of pages we would need for users to get the results that would prove TANDEM’s value?

These are, of course, huge questions. But to put it all in perspective: Stephen Zwiebel told Kelly this week that DH Box was held together “by tape” at the time of the final project presentations, and that it has had a lot more time in the past 9 months to become stable. Not to say that we aren’t looking to have a (minimally viable) product come May, but it’s a good feeling to know where other groups were last year. Should we be sharing that widely with the class? Well, we just did. 🙂

THANKS FOR FOLLOWING @dhTANDEM!

a bicycle built by four, for two (text AND image)

 

NYC Fashion Index Weekly Update

Screenshot_2015-03-03-13-14-32

 

 

As previously mentioned, we chose Python to incorporate data set from Instagram API.

It would be necessary to utilize My SQL for geospatial libraries. For data mining, we focus on collecting geospatial tags in NYC areas to analyze fashion. Using relational database will fit for our project. We are planning to incorporate multiple data sets e.g.) hashtags+Images+Link between images and hashtags+ Iocational information (where the pic was taken, neighborhoods)

Ren and Tessa met Prof. Gold to get feedback for our projects. First of all, our project has to emphasize the “analytic research angle.” It would be better to observe patterns across multiple data sets. Also, we discovered the fact that crowdsourcing was unpredictable and unreliable. Regarding to this matter, we would rather lean more towards curated approaches for our project.