Data Project: Female Writer Metadata in Wikidata

1 Leave a comment on paragraph 1 0 The unsatisfactory representation of women on Wikipedia has received much attention in recent years (such as here). Specifically, a dearth of coverage of women, as well as bias within the Wikipedia articles on women that do exist, have both been observed. This issue has been connected to the low percentage of female Wikipedia editors; according to the article above, 84-91% of Wikipedia editors are male. In particular, Wikipedia’s inadequate coverage of female writers has been highlighted. Although measures have been taken to combat this general lack of representation, it remains a problem.

2 Leave a comment on paragraph 2 0 This representation issue extends to Wikipedia’s linked data-related counterparts, DBpedia, which extracts linked data from Wikipedia articles, and Wikidata, which supplies data for both the public at large and Wikipedia articles themselves. Interesting analyses of Wikipedia’s linked data initiatives are already being done. For example, the Wikidata Human Gender Indicators project has examined the intersection of gender and other aspects such as occupation, country, and ethnicity in Wikipedia biography articles and makes their findings available as an open data set. This group also updates their data continually in response to Wikidata updates. Due to my great interest in linked data, I wanted to explore the area of gender treatment in Wikipedia’s linked data initiatives further. In particular, I wished to examine not only Wikidata’s coverage of women writers but also the depth of existing Wikidata entries on women writers, as analyzed through the lens of the metadata used in these entries.

3 Leave a comment on paragraph 3 0 I chose to do so using query language and a publicly accessible tool provided by Wikidata. I learned the basics of SPARQL (SPARQL Protocol and RDF Query Language), a query language developed specifically for querying linked data, which is in the form of RDF, during a professional development course for librarians that I took this summer, and I was able to further practice query language (specifically MySQL) in the two Digital Fellows workshops I attended at the beginning of this semester: Databases Part I: Introduction to Data Management with Databases and SQL and Databases Part II: Querying in the Real World. Although some linked datasets must be queried using a command line tool or other tool separate from the database, some dataset providers, such as Wikidata, provide access to a SPARQL endpoint. Wikidata’s SPARQL endpoint, Wikidata Query Service, provides a user-friendly interface for entering query language to explore Wikidata in its entirety.

4 Leave a comment on paragraph 4 0 I constructed and ran a series of SPARQL queries to first of all see for myself the extent of Wikidata’s current coverage of women writers and then delved into specific properties used in the entries on women. Before beginning to explore the data formally using SPARQL, I looked through the Wikidata pages for well-known writers to get a sense of what properties were commonly used and examined Wikidata’s properties browsers, which allowed me to gain exposure to the extensive number of properties used on Wikidata. I referred to these resources throughout the querying process. In particular, I searched for the following properties in articles on both women and men writers: “notable works,” “described by source,” “archives at,” “genre,” “list of works,” and “influenced by,” all of which I believe represent valuable writer data. I used multiple queries so that I was able to see, for each property, the percentage of articles among each gender/sex category (a single property is used for gender and sex on Wikidata) in which at least one instance of the property was present as well as the percentage of total instances of the property among each gender/sex category. Once I finished each query, I used the export option to view and manipulate the data in Excel. Result sets are displayed on the screen and can be exported in CSV and other formats. Despite using query language aimed at the elimination of duplicates, I sometimes had to remove more duplicates from the exported data in Excel.

5 Leave a comment on paragraph 5 0 In one sense, my results were what I expected, in that I found there to be far more articles on male writers (including transmen writers): 99,695 as opposed to the 27,134 I found for female writers (including transwomen writers). In terms of each of the specific properties analyzed, although the percentage of total occurrences of the property among each gender/sex category and the percentage of articles among each gender/sex category that had at least one occurrence of the property were higher for men for four of the six categories, they were higher for women for two categories (“notable works” and “archives at”), indicating that these metadata categories were slightly richer for articles on women writers. However, it is important to note that the percentages were quite low overall; there are perhaps more relevant properties that are often used in Wikidata articles on writers on which I could have focused.

6 Leave a comment on paragraph 6 0 The dataset, tool, and methods outlined here have some strengths: notably, the structured nature of the data and the user-friendliness of the SPARQL endpoint, both of which ease the data exploration process. However, this project is rife with limitations as well. First, SPARQL endpoints are not always completely accurate and reliable. For example, the use of these endpoints sometimes results in long delays if the result set is large or the query complex. I had to go through a number of queries before finally using a simple count query that returned the number of male writers in Wikidata. All of my initial queries resulted in the message “Query timeout limit reached” because the result set was so large. Another challenge is that Wikidata content is constantly changing. I noticed slight changes in my search results over the space of the few days during which I did my queries. Regarding the dataset itself, it’s difficult to know for certain whether the absence or presence of some of the properties in the articles (such as the “archives at” property) are indicative of real-world deficiencies. Additional datasets would have to be consulted in order to explore this.

7 Leave a comment on paragraph 7 0 Additionally, my querying skills admittedly could use some improvement; developing better skills in this arena could perhaps allow me to carry out more complex and interesting queries. Also, this is not highly original work – as mentioned above, others have already done great work with this and similar datasets. I also stopped after only an initial exploration of the data. Much more could be explored in terms of not only additional data points but also visualizations and so forth. On a more personal note, this project has strengthened my resolve to participate in future linked data-related initiatives. While looking through the query results, I noticed a number of articles in need of improvement.

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted November 29, 2016 at 4:59 pm | Permalink

    Kate, this is an important and interesting project. There are obviously two major issues with Wikipedia: the preponderance of male editors and writers who develop and oversee/make determinations about content of the entries; and the very real gender mismatch of the total number of entries on male vs female subjects (regardless of who is writing and editing those entries). The Wikimedia Foundation is very aware of the problem and has engaged in outreach efforts to expand the number of entries on women and on the number of women contributors to Wikipedia. The Graduate Center’s own Michael Mandiberg has helped lead those efforts, as you can see in this piece in the New Yorker: and in this ArtNews piece:

    Your initial linked data work could be expanded and deepened. I suspect playing around with the Wikidata more would help overcome the frustrating “timed out” problem. Is you intention to dig deeper into this data or to move on to something else?

  2. Posted December 5, 2016 at 2:19 am | Permalink

    Thanks for your feedback, Dr. Brier! I hadn’t seen the articles you linked to, so thank you. In terms of the final project, I will be moving on to something else. I do hope to continue to explore this issue, but at this point I want to do so by taking action rather than continuing to reinforce the issue of gender gaps in the data of which many are already aware. I haven’t been able to attend any edit-a-thons yet, but I hope to do so at the next opportunity as well as do some editing on my own and contribute to linked data-focused Wikipedia efforts. Perhaps at a later point I can redo my queries to see how things have changed. I do like Mandiberg’s statement in the New Yorker article regarding community and empowerment, though. I agree that taking action on these issues can bring about additional good in the world that perhaps cannot be captured by data.

Additional comments powered byBackType

  • Archives

  • Welcome to Digital Praxis 2016-2017

    Encouraging students think about the impact advancements in digital technology have on the future of scholarship from the moment they enter the Graduate Center, the Digital Praxis Seminar is a year-long sequence of two three-credit courses that familiarize students with a variety of digital tools and methods through lectures offered by high-profile scholars and technologists, hands-on workshops, and collaborative projects. Students enrolled in the two-course sequence will complete their first year at the GC having been introduced to a broad range of ways to critically evaluate and incorporate digital technologies in their academic research and teaching. In addition, they will have explored a particular area of digital scholarship and/or pedagogy of interest to them, produced a digital project in collaboration with fellow students, and established a digital portfolio that can be used to display their work. The two connected three-credit courses will be offered during the Fall and Spring semesters as MALS classes for master’s students and Interdisciplinary Studies courses for doctoral students.

    The syllabus for the course can be found at

  • Categories

Skip to toolbar