Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.
I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.
On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.
So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.
I am going to download Great Expectations from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.
Thanks for sharing your initial steps. I have been making tiny explorations with Project Gutenberg texts myself, and thanks to an incredibly helpful sit down with Micki at Digital Fellows office hours yesterday, I am feeling a little more empowered to play with the text data (she fixed gephi java antagonism happening on my computer and pointed me to topic modeling and antconc). I look forward to hearing more about your next steps, especially as I make my own attempts.
The LYNDA.com site you (and Matt) turned us on to has some tutorials on converting PDF to text for mapping. I’m watching those today…. Talk Soon.
THANK YOU for bringing up Project Gutenberg. I went exploring the data they make available recently, and found that the pre-collected data were all RDF “metadata dumps”. (Those are the words of Gutenberg, not me.) My problem is then within my own limitations of manipulating data — how does one go about handling data in RDF? What do I open that with? Can I clean it up and convert it to CSV? Project Gutenberg was were I first experienced this hurdle, but I’m finding it more and more elsewhere.
If anyone out there in DH Praxis land has a good solution to this, I’d love to hear it. Otherwise I found some other CSV data elsewhere.
Thank you for sharing your process and for sharing of the resources. I have no idea what I am doing but am working towards some progress.
These first steps are always most difficult, and I really appreciate you sharing the process!
Looking forward to more,