At The Lexicon of DH workshop run by the CUNY Digital Fellows on September 29, our class’ very own Jojo Karlin ran a very informative and engaging discussion of what it means to do–and how to do–digital humanities work. In the course of the workshop, a very interesting question (to this librarian at least) was posed by a participant: Is there a database for all the data that is available? Perhaps if there was, I’d be out of a job!
The same participant asked, for argument’s sake, where they could go for information on US theater productions in the last century. I suggested the Internet Theater Database, but it was a resource I was already familiar with (and its scope is limited) and there’s also the Internet Broadway Database (scope also limited), but the inquirer was asking for a source to go to if they did not already know of a data set. So, I quickly tried to find such a database of data sets during the workshop. Not surprisingly, I have not been able to find this comprehensive database of all data (at least not yet). Of course, there are several issues at play as to why this database does not exist, not least of which is the price of information and the commercialization of data which often puts limits on how information is shared, in addition to the cost of compiling, hosting, and designing such a database.
Nonetheless, as with much of the DH community, arguments for sharing data are strongly rooted in the idea that openness will create an opportunity for growth and development. So, I wanted to share some of the projects and resources that I have found that are working towards an amalgamation of open data sets out there…
Conveniently enough, last week I was doing some collection development for the library I work at and came across a recommendation from Choice publication for Data USA. According to the “About” page, this project aims to place “public US Government data in your hands. Instead of searching through multiple data sources that are often incomplete and difficult to access…” As with much of the data currently available through this kind of portal, it is largely produced by government agencies.
During one of our initial class discussions, we went around and stated what projects or tools interested us or what we wanted to know more about. I stated that I was interested in the DPLA (Digital Public Library of America) and they do a great job of aggregating data from a range of cultural institutions throughout the country (and they make it pretty easy to access the data): https://dp.la/info/developers/
Where would the state of information be if Google didn’t have some handle on aggregating data???
Open Knowledge International has put together http://dataportals.org/
Their mission statement is very noble indeed: “We want to see enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all.”
And a few other sources I’ve found (and there are so many more!):
- http://about.jstor.org/service/data-for-research
- https://www.icpsr.umich.edu/icpsrweb/
- http://data.un.org/
- http://gdeltproject.org/about.html
- http://wiki.dbpedia.org/
- http://www.pewinternet.org/datasets/
- http://www.datacenter.org/research-tools/web-resources/
Forbes.com published a helpful list off 33 free big data sets earlier this year: http://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#87796d267961
So much data is out there, but not in one place. Also, for the workshop participant inquiring about theater data, there does not seem to be a database for them. Although, for the record, they were only inquiring about theater data as a hypothetical and did not claim to actually be interested in that line of scholarship! It is my hope that more projects develop and the spirit of openness adopted by many scientific and government communities permeates across the spectrum of disciplines and industries so that one day there will be a database of all data.