Editors’ Choice: Counting words in HathiTrust with Python and MPI

1 Leave a comment on paragraph 1 0 Lisa’s note: This post was featured as an Editors’ Choice on Digital Humanities Now. We won’t be talking about NLP projects or the HathiTrust for a couple of weeks, but if you’re interested in text analysis, this piece by David McClure helps talk through some existing projects step-by-step.

2 Leave a comment on paragraph 2 0 ~~~~~~~~~~~~~

3 Leave a comment on paragraph 3 0 In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data set from HathiTrust. This is a fantastic resource, and I owe a huge debt of gratitude to everyone at HTRC for putting it together and maintaining it. The extracted features are essentially a set of very granular word counts, broken out for each physical page in the corpus and by part-of-speech tags assigned by the OpenNLP parser. With just the per-page token counts, it is possible to do a really wide range of interesting things – tracking large-scale changes in word usage over time, looking at how cohorts of words do or don’t hang together at different points in history, etc. It’s an interesting constraint – the macro (or at least meso) scale is more strictly enforced, since it’s harder to dip back down into a chunk of text that can actually be read, in the regular sense of the idea.

4 Leave a comment on paragraph 4 0 Continue reading: here.


Source: https://dhpraxisfall16.commons.gc.cuny.edu/2016/09/09/editors-choice-counting-words-in-hathitrust-with-python-and-mpi/

Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar