Challenge number one is finding a dataset I can work with. I want to do something with text. There are plenty of books available digitally, but they always seem to be in formats, such as PDF, that require a fair amount of cleaning (which I would rather not do) before they can be consumed by any of the tools.
I did find a number interesting datasets on the NYC Open Data website. Many of them are easy to download. I downloaded one small one that contains the average SAT scores for each High School. I was able to bring the dataset into R, too. It’s such a simple dataset–basically just school id plus the scores of each section of the SAT–that there isn’t much analysis one can do (mean, median, standard deviation…?). I would like to combine it with some other datasets that could enrich the data. For example, if I could get the locations of the schools, I could map the SAT data or if I could get demographic data for each school, I could correlate SAT performance to various demographic variables.
On a parallel track, I am working with Joy, who is very interested in motherhood and wants to explore datasets related to that subject. If she can find a compelling dataset, we will work together to analyze it.
So it turns out that Project Gutenberg has books in TXT format that you can download. It also appears that there is a website www.zamzar.com that will convert PDF files to text files for free. Question for further thought: if the PDF is a scanned image, will converting it to text get me anywhere? I doubt it. Best way to find out is to give it a go.
I am going to download Great Expectations from Project Gutenberg and see if I can pull it into R and, perhaps, Gephi or Antconc to see what can be done that might be interesting.