Editors’ Choice: Counting words in HathiTrust with Python and MPI

By Lisa Marie Rhody | Published: September 9, 2016

¶ 1 Leave a comment on paragraph 1 0 Lisa’s note: This post was featured as an Editors’ Choice on Digital Humanities Now. We won’t be talking about NLP projects or the HathiTrust for a couple of weeks, but if you’re interested in text analysis, this piece by David McClure helps talk through some existing projects step-by-step.

¶ 2 Leave a comment on paragraph 2 0 ~~~~~~~~~~~~~

¶ 3 Leave a comment on paragraph 3 0 In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data set from HathiTrust. This is a fantastic resource, and I owe a huge debt of gratitude to everyone at HTRC for putting it together and maintaining it. The extracted features are essentially a set of very granular word counts, broken out for each physical page in the corpus and by part-of-speech tags assigned by the OpenNLP parser. With just the per-page token counts, it is possible to do a really wide range of interesting things – tracking large-scale changes in word usage over time, looking at how cohorts of words do or don’t hang together at different points in history, etc. It’s an interesting constraint – the macro (or at least meso) scale is more strictly enforced, since it’s harder to dip back down into a chunk of text that can actually be read, in the regular sense of the idea.

¶ 4 Leave a comment on paragraph 4 0 Continue reading: here.

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

Editors’ Choice: Counting words in HathiTrust with Python and MPI

Archives

Welcome to Digital Praxis 2016-2017

Categories

Search

Recent Posts

Need help with the Commons?

Editors’ Choice: Counting words in HathiTrust with Python and MPI

Archives

Welcome to Digital Praxis 2016-2017

Categories

Keywords

Search

Recent Posts

Tags