Let’s build a GUI that combines the power of Google’s Tesseract OCR and FeatureExtractor.
The idea is to build an environment (web-based or standalone) where you can take your text overlayed object, toss it in, and have a save ready output file to take away with you. Generate the data you need to visualize, explore, parse apart, and build the story. There is an interesting dialogue between text and images happening in comics, children’s picture books, marketing materials, illustrated maps, illuminated manuscripts, etc. Get your data, understand the output variables in simple and easy to reference ways, and get back to finding your story.
- Team Note: Be ready to learn. Everyone involved in this intimidating project will bleed through their role and engage in collaborative learning of each of the elements needed to complete the project. Developers must understand design, designers must understand branding, outreach/branding must understand how the thing works, and the project coordinator must understand how to get conversations rolling to hit deadlines.
- Developer: Bravery. Develop clean and clear code that will allow us to wrap our OCR and Image Processing Software as modules to be placed in the overall software. Working understanding of code and willingness to dedicate time to digging into what needs to be written to get this off the ground. Knowledge of Python or a single language at the very least.
- Designer: Understand user interaction and develop aesthetically simple, intuitive interface. Understand design basics, have a working proficiency in Adobe Design programs. Also maintain brand identity in conjunction with Outreach Coordinator.
- Outreach Coordinator: Social butterflies. We need community support. Work on creating a voice and an audience for this project. Using not only social media but having the ability to track where our message is working best. What tweets work, what outlets are giving good feedback. We need to make a communal conversation that helps us reach our goal.
- Project Manager: Keep your hand on the pulse of the schedule, set deadlines, gather learning resources, keep open lines of communication between team members. I have so many people in mind for this and each one can potentially bring an entirely different outcome to the project. I want a project manager who wants to see this thing materialize.
Google’s Tesseract OCR
Python-tesseract is a python wrapper for google’s Tesseract-OCR
FeatureExtractor (Let’s talk to Lev about this. It is one of his tools afterall.) –
PyJamas GUI Toolkit
So it seems like the only way one can get old data from Twitter is to pay for it. There is a site called Topsy that seemed promising at first, because they do let you get data from several years back. However, making any reliable conclusions from that data would be hard, because they only give you ten pages of results, and you can only specify a date range, but not a time range. For the periods of time I am interested in there was so much tweeting going on that ten pages of results only covers like two hours (and there is no controlling which two hours they show you). Plus, I think they are showing “top tweets,” rather than a feed of tweets in the order they happened, so that is another factor limiting their usefulness. Not to mention Topsy doesn’t offer locational data. The people at Topsy support sent me a list of other vendors including Gnip and Datasift, which both cost. money I also looked at Keyhole, which looks great, but again, it costs money to get “historical data” from them.
Unless someone has any ideas about how to get historical data off Twitter without having to pay for it, I think I am going to shift my focus temporarily to working on tweeting surrounding the election, which should be easier, because it just happened. I will need to learn how to use the Twitter API to do that, though, so if anyone knows how, I would much appreciate any pointers you could give me. Also, I will need to figure out what my focus would be here–particularly in light of Wendy Davis’s recent loss of the gubernatorial race. Maybe I could compare this year to past years? Perhaps there was more support for Wendy Davis than for other democratic candidates?
I am still attached to my earlier idea about mapping the locational data, though, and thinking about transforming this into a final project proposal. I think there are probably some organizations in Texas that might consider funding something like this, though I am also interested in broadening the project out to have a wider appeal. Here are some ideas I had for expanding it. (I would welcome other ideas or suggestions):
- Doing something like HyperCities, where video and other media could be layered over the map of tweets
- Charting the tweets of conservatives as well as liberals.
- Charting the change in tone of the tweets over multiple important moments such as the hearings and Wendy Davis’s filibuster
- Doing a data visualization of the other hashtags that were used in relation to the ones I already know about.