Editors’ Choice: Counting words in HathiTrust with Python and MPI

1 Leave a comment on paragraph 1 0 Lisa’s note: This post was featured as an Editors’ Choice on Digital Humanities Now. We won’t be talking about NLP projects or the HathiTrust for a couple of weeks, but if you’re interested in text analysis, this piece by David McClure helps talk through some existing projects step-by-step.

2 Leave a comment on paragraph 2 0 ~~~~~~~~~~~~~

3 Leave a comment on paragraph 3 0 In recent months we’ve been working on a couple of projects here in the Lab that are making use of the Extracted Features data set from HathiTrust. This is a fantastic resource, and I owe a huge debt of gratitude to everyone at HTRC for putting it together and maintaining it. The extracted features are essentially a set of very granular word counts, broken out for each physical page in the corpus and by part-of-speech tags assigned by the OpenNLP parser. With just the per-page token counts, it is possible to do a really wide range of interesting things – tracking large-scale changes in word usage over time, looking at how cohorts of words do or don’t hang together at different points in history, etc. It’s an interesting constraint – the macro (or at least meso) scale is more strictly enforced, since it’s harder to dip back down into a chunk of text that can actually be read, in the regular sense of the idea.

4 Leave a comment on paragraph 4 0 Continue reading: here.

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.

Additional comments powered byBackType

  • Archives

  • Welcome to Digital Praxis 2016-2017

    Encouraging students think about the impact advancements in digital technology have on the future of scholarship from the moment they enter the Graduate Center, the Digital Praxis Seminar is a year-long sequence of two three-credit courses that familiarize students with a variety of digital tools and methods through lectures offered by high-profile scholars and technologists, hands-on workshops, and collaborative projects. Students enrolled in the two-course sequence will complete their first year at the GC having been introduced to a broad range of ways to critically evaluate and incorporate digital technologies in their academic research and teaching. In addition, they will have explored a particular area of digital scholarship and/or pedagogy of interest to them, produced a digital project in collaboration with fellow students, and established a digital portfolio that can be used to display their work. The two connected three-credit courses will be offered during the Fall and Spring semesters as MALS classes for master’s students and Interdisciplinary Studies courses for doctoral students.

    The syllabus for the course can be found at cuny.is/dps17.

  • Categories

Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar