¶ 1 Leave a comment on paragraph 1 0 Lisa’s note: This post was featured as an Editors’ Choice on Digital Humanities Now. We won’t be talking about NLP projects or the HathiTrust for a couple of weeks, but if you’re interested in text analysis, this piece by David McClure helps talk through some existing projects step-by-step.
¶ 3 Leave a comment on paragraph 3 0 In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data set from HathiTrust. This is a fantastic resource, and I owe a huge debt of gratitude to everyone at HTRC for putting it together and maintaining it. The extracted features are essentially a set of very granular word counts, broken out for each physical page in the corpus and by part-of-speech tags assigned by the OpenNLP parser. With just the per-page token counts, it is possible to do a really wide range of interesting things – tracking large-scale changes in word usage over time, looking at how cohorts of words do or don’t hang together at different points in history, etc. It’s an interesting constraint – the macro (or at least meso) scale is more strictly enforced, since it’s harder to dip back down into a chunk of text that can actually be read, in the regular sense of the idea.