Data Project: Interpretive Analysis of Congressional Research Service Reports

1 Leave a comment on paragraph 1 0 Background

2 Leave a comment on paragraph 2 0 I would like to analyze publicly available reports by the Congressional Research Service (CRS). CRS is housed within the Library of Congress as an agency of the legislative branch; its purpose is to provide the United States Congress with non-partisan advice on issues that may come before the legislature. CRS is often referred to as “Congress’s think tank.” CRS publishes reports that are made available to Congressional staff on an internal website. Government staff and journalists can receive reports upon request, but the public generally cannot obtain CRS reports unless they are made available by CRS or Members of Congress. However, there have been many groups that, for the past two decades, have been advocating for greater public access to CRS reports. One group is Demand Progress, which has released every CRS report available on Congress’s internal website (currently numbering 8,277) on the website www.everycrsreport.com.

3 Leave a comment on paragraph 3 0 From my past experience volunteering with groups such as Code for America, I’ve developed a general interest in open government and open data, and how those issues potentially change the public’s relationship with the government vis a vis more open streams of information and communication from the government. On social media, I follow certain organizations such as the Data Foundation, which is strongly pro-open data, given my interest. I came across the Every CRS Report website by chance, as the Data Foundation had posted about the website on its Facebook page and it popped up on my newsfeed.

4 Leave a comment on paragraph 4 0 When I found out I could download thousands of CRS reports, I wanted use this textual data to experiment with applying digital methods aiding the practice of interpretive policy analysis. According to Dvora Yanow, interpretive policy analysis asks, “What are the meanings of a policy?” rather than “What are the costs of a policy?” and other quantitative measurements (Yanow v). The preface in Yanow’s book Conducting Interpretive Policy Analysis continues:

5 Leave a comment on paragraph 5 0 Yanow’s book builds from the premise that the promise and implications of a policy are not transparent and easily evident in its text. Instead they are hidden (and sometimes incompatible) conclusions that are warranted in different ways by the assumptions of policymakers and multiple constituencies. To unwrap these perspectives, the interpretive policy analyst must identify groups of stakeholders and the “policy artifacts” (consisting of symbolic language, objects, and actions) that determines how a policy, together with the policy process, is “framed” or understood. (Yanow v)

6 Leave a comment on paragraph 6 0 It seems to me that the digital, textual analysis of government documents from a source as respected and non-partisan as the Congressional Research Service would serve as a good entryway to analyze different policies from multiple perspectives. In describing the methodology of interpretive policy analysis, Yanow notes that “sources of local knowledge” may include government document analysis, which yields the following data: written language, descriptions of objects, historical records of events, acts, interactions (39).

7 Leave a comment on paragraph 7 0 Interpretive policy analysis also includes other methods of data collection such as oral sources, observation, and participation, as if the analyst “were building a Picasso-like portrait of an organizational or community or policy “face” from multiple angles. In this way, the analyst attempts to see the issue and its meanings from as many angles as possible. (Yanow 38)

8 Leave a comment on paragraph 8 0 The beginnings of this data project will hopefully allow me to use digital methods like we’ve learned in class such as topic modelling and concordance analysis as a first step in the distant reading of thousands of government documents. The next step would be to narrow down specific policy documents in order to perform close readings. Eventually, those close readings would be incorporated into multi-method analyses of policies that include qualitative interviewing, ethnography, and other future work. Within the field of political science more broadly, scholars have performed topic modeling on open-ended government surveys and political speeches. I’m trying to get a better feel of the literature, but so far I get the feeling from reviewing these articles that the topic model serves as the product of their analysis. I want to stress in my data project that my use of these tools is an assistive exercise; digital methods have the potential to greatly enhance interpretive methodologies but cannot replace them–at least that is my belief so far.

9 Leave a comment on paragraph 9 0 Tool Summary

10 Leave a comment on paragraph 10 0 I used the following tools to retrieve my data:

12 Leave a comment on paragraph 12 0 I used Filezilla in an attempt to upload my data.

13 Leave a comment on paragraph 13 0 I used AntConc to perform a preliminary keyword in context analysis of my data.

14 Leave a comment on paragraph 14 0 Data Collection

15 Leave a comment on paragraph 15 0 The Every CRS Report website contains a handy section on how to bulk download every single report made available on their website. They provide a complete listing of reports published in a CSV file that contains metadata such as the report number, the most recent publication date of the report, and paths to the most recent PDFs and HTML files for the reports. At first, I wasn’t sure how to download the actual text of the report. When I searched for the reports I was interested in obtaining on the home page of the website, for example “big data”, the page loaded an embedded pop-up containing site-specific Google search results. When I clicked on a result, I got a page that had the report in HTML form, but also tables to the right containing report revision histories in PDF and HTML; the JSON metadata; and the topic areas. I only wanted the text of the report.

16 Leave a comment on paragraph 16 0 The website does offer a Python script with instructions on how to download the bulk reports, so I went to the GC Digital Fellows office hours for some help with potentially using Python to download the complete archive. Jojo kindly offered to help me, but we ran into some problems with not being able to load certain Python libraries. I’m still a beginner with Python, so my attempts to use it to scrape the report data from HTML files had to be paused momentarily.

17 Leave a comment on paragraph 17 0 That’s not necessarily what I want or need, so I had to figure out a way to download every report. The website comes with a csv with metadata and links to the report. They offered a Python script on how to download the bulk reports, so I went to the GC Digital Fellows office hours for some help. Jojo kindly offered to work with me, but we ran into some problems not being able to load certain libraries. I’m still a beginner with Python, so my attempts to use Python to scrape the report data from HTML files had to stop there.

18 Leave a comment on paragraph 18 0 My next idea was to bulk download the reports using the Firefox extension DownThemAll. I took the last column of the reports.csv file on the website, which contained links to the latest HTML fragments of the report. I used Sublime Text to convert that long list into an HTML file, using the find and replace functions to add the website path, as well as <br> tags at the end of each filename. I uploaded the HTML file to my website and downloaded 7213 HTML reports to my hard drive. I then used HTMLAsText (found through a Google search) to convert those HTML files into plain txt files. Both tools worked quite well.

19 Leave a comment on paragraph 19 0 Preliminary Analysis

20 Leave a comment on paragraph 20 0 I used AntConc to analyze the concordances in the downloaded reports. I wanted to perform keyword in context analysis, which I learned more about during Micki Kaufman’s workshop on data visualization. Folgerpedia provides a good analysis of keyword in context:

21 Leave a comment on paragraph 21 0 A type of concordance output that sorts and aligns words within a textual sample alphabetically and in conjunction with surrounding text. Instead of isolating search terms in a list of individual words, KWIC allows users to see the results of a search within a limited context, providing a fuller meaning.

22 Leave a comment on paragraph 22 0 I wanted to analyze the discourse around a certain policy issue, such as big data, and see how discussions have potentially morphed over time–how the meanings of big data were communicated to policy-makers by these non-partisan analysts at CRS and the implications of these meanings for public administration and science and technology policy.

23 Leave a comment on paragraph 23 0 I loaded the corpus into AntConc. I will do more research into how I can maximize AntConc as well as how to learn the inner workings of concordance analysis, but here is a preliminary screenshot:

24 Leave a comment on paragraph 24 0 cb antconc

25 Leave a comment on paragraph 25 0 Some interesting things to point out and question already are:

  • 26 Leave a comment on paragraph 26 0
  • The first time that CRS used the term “big data” was in 2012
  • Technical keywords: analytics, visualization
  • The word privacy came up – in what context? Civil rights?
  • Big data is described as both a “big challenge” and as an “opportunity”

27 Leave a comment on paragraph 27 0 Where to Find the Data

28 Leave a comment on paragraph 28 0 I uploaded the html and txt files of the reports to http://clairebalani.com/dataproject/. I also uploaded the original reports.csv obtained from everycrsreport.com, as well as my modifications.

29 Leave a comment on paragraph 29 0 Challenges and Next Steps

30 Leave a comment on paragraph 30 0 One challenge that I need to rectify is the complete uploading of the actual data to my server. I used Filezilla to initiate an FTP transfer, but I ran into some problems with certain uploads.

31 Leave a comment on paragraph 31 0 Another challenge is ensuring the integrity of the data and a complete dataset. While HTMLAsText seemed to convert the exact number of HTML files to txt files without duplicates, but I need to make sure.  I will also have to take into account scraping data from PDF files from the Every CRS Report website that haven’t yet been made available as an HTML file. Additionally, if the website releases more reports, I should probably establish a cut-off date for my data analysis.

32 Leave a comment on paragraph 32 0 There is also the question of cleaning the text within each report. Jojo helpfully pointed out that I may want to re-consider including footnotes within each report, and to think critically about which sections of the report I want to include in an analysis (the summary? Or just the conclusion?) I’ll probably refer back to Yanow and other interpretive policy analysts to see how they’ve treated document analysis in their own research in order to make a final decision.

33 Leave a comment on paragraph 33 0 Finally, I want to visualize the textual analysis I perform using AntConc. In this early stage, I’m thinking of using Gephi to perform social network analysis to visualize as nodes the different institutions, people, places, and things that were mentioned in these reports.

This entry was posted in Student Post, Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted November 16, 2016 at 3:26 pm | Permalink

    What a great corpus to work with, Claire. Looking forward to hearing more as you move forward. You may want to try going to the Python User’s Group meeting (PUG) on Wednesdays from 12-2 for help with the Python libraries you’re interested in using. Also, there is a “text analysis” group starting up that you may also find to be a useful resource moving forward.

  2. Posted November 19, 2016 at 5:36 pm | Permalink

    Claire, I’m impressed by the sheer size of the CRS corpus and wonder if it might make more sense to pull out a subset to work with at this stage rather than try to deal with the unwieldy size of the .txt files you will have to deal with. Perhaps you could pick one or two especially rich years for reports, or compare ones in different Congress’s, say one from 2009 and one from 2014 (not exactly clear from your blog post or the Every CRS Report website what years are included). Your commitment to open government data is admirable and I think letting the public know what can be found in such a rich corpus of digital material will be extremely beneficial. It will be interesting also to see if the CRS remains as open under the “New Regime.”

Additional comments powered byBackType

  • Archives

  • Welcome to Digital Praxis 2016-2017

    Encouraging students think about the impact advancements in digital technology have on the future of scholarship from the moment they enter the Graduate Center, the Digital Praxis Seminar is a year-long sequence of two three-credit courses that familiarize students with a variety of digital tools and methods through lectures offered by high-profile scholars and technologists, hands-on workshops, and collaborative projects. Students enrolled in the two-course sequence will complete their first year at the GC having been introduced to a broad range of ways to critically evaluate and incorporate digital technologies in their academic research and teaching. In addition, they will have explored a particular area of digital scholarship and/or pedagogy of interest to them, produced a digital project in collaboration with fellow students, and established a digital portfolio that can be used to display their work. The two connected three-credit courses will be offered during the Fall and Spring semesters as MALS classes for master’s students and Interdisciplinary Studies courses for doctoral students.

    The syllabus for the course can be found at cuny.is/dps17.

  • Categories

Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar