¶ 2 Leave a comment on paragraph 2 0 I would like to analyze publicly available reports by the Congressional Research Service (CRS). CRS is housed within the Library of Congress as an agency of the legislative branch; its purpose is to provide the United States Congress with non-partisan advice on issues that may come before the legislature. CRS is often referred to as “Congress’s think tank.” CRS publishes reports that are made available to Congressional staff on an internal website. Government staff and journalists can receive reports upon request, but the public generally cannot obtain CRS reports unless they are made available by CRS or Members of Congress. However, there have been many groups that, for the past two decades, have been advocating for greater public access to CRS reports. One group is Demand Progress, which has released every CRS report available on Congress’s internal website (currently numbering 8,277) on the website www.everycrsreport.com.
¶ 3 Leave a comment on paragraph 3 0 From my past experience volunteering with groups such as Code for America, I’ve developed a general interest in open government and open data, and how those issues potentially change the public’s relationship with the government vis a vis more open streams of information and communication from the government. On social media, I follow certain organizations such as the Data Foundation, which is strongly pro-open data, given my interest. I came across the Every CRS Report website by chance, as the Data Foundation had posted about the website on its Facebook page and it popped up on my newsfeed.
¶ 4 Leave a comment on paragraph 4 0 When I found out I could download thousands of CRS reports, I wanted use this textual data to experiment with applying digital methods aiding the practice of interpretive policy analysis. According to Dvora Yanow, interpretive policy analysis asks, “What are the meanings of a policy?” rather than “What are the costs of a policy?” and other quantitative measurements (Yanow v). The preface in Yanow’s book Conducting Interpretive Policy Analysis continues:
¶ 5 Leave a comment on paragraph 5 0 Yanow’s book builds from the premise that the promise and implications of a policy are not transparent and easily evident in its text. Instead they are hidden (and sometimes incompatible) conclusions that are warranted in different ways by the assumptions of policymakers and multiple constituencies. To unwrap these perspectives, the interpretive policy analyst must identify groups of stakeholders and the “policy artifacts” (consisting of symbolic language, objects, and actions) that determines how a policy, together with the policy process, is “framed” or understood. (Yanow v)
¶ 6 Leave a comment on paragraph 6 0 It seems to me that the digital, textual analysis of government documents from a source as respected and non-partisan as the Congressional Research Service would serve as a good entryway to analyze different policies from multiple perspectives. In describing the methodology of interpretive policy analysis, Yanow notes that “sources of local knowledge” may include government document analysis, which yields the following data: written language, descriptions of objects, historical records of events, acts, interactions (39).
¶ 7 Leave a comment on paragraph 7 0 Interpretive policy analysis also includes other methods of data collection such as oral sources, observation, and participation, as if the analyst “were building a Picasso-like portrait of an organizational or community or policy “face” from multiple angles. In this way, the analyst attempts to see the issue and its meanings from as many angles as possible. (Yanow 38)
¶ 8 Leave a comment on paragraph 8 0 The beginnings of this data project will hopefully allow me to use digital methods like we’ve learned in class such as topic modelling and concordance analysis as a first step in the distant reading of thousands of government documents. The next step would be to narrow down specific policy documents in order to perform close readings. Eventually, those close readings would be incorporated into multi-method analyses of policies that include qualitative interviewing, ethnography, and other future work. Within the field of political science more broadly, scholars have performed topic modeling on open-ended government surveys and political speeches. I’m trying to get a better feel of the literature, but so far I get the feeling from reviewing these articles that the topic model serves as the product of their analysis. I want to stress in my data project that my use of these tools is an assistive exercise; digital methods have the potential to greatly enhance interpretive methodologies but cannot replace them–at least that is my belief so far.
¶ 15 Leave a comment on paragraph 15 0 The Every CRS Report website contains a handy section on how to bulk download every single report made available on their website. They provide a complete listing of reports published in a CSV file that contains metadata such as the report number, the most recent publication date of the report, and paths to the most recent PDFs and HTML files for the reports. At first, I wasn’t sure how to download the actual text of the report. When I searched for the reports I was interested in obtaining on the home page of the website, for example “big data”, the page loaded an embedded pop-up containing site-specific Google search results. When I clicked on a result, I got a page that had the report in HTML form, but also tables to the right containing report revision histories in PDF and HTML; the JSON metadata; and the topic areas. I only wanted the text of the report.
¶ 16 Leave a comment on paragraph 16 0 The website does offer a Python script with instructions on how to download the bulk reports, so I went to the GC Digital Fellows office hours for some help with potentially using Python to download the complete archive. Jojo kindly offered to help me, but we ran into some problems with not being able to load certain Python libraries. I’m still a beginner with Python, so my attempts to use it to scrape the report data from HTML files had to be paused momentarily.
¶ 17 Leave a comment on paragraph 17 0 That’s not necessarily what I want or need, so I had to figure out a way to download every report. The website comes with a csv with metadata and links to the report. They offered a Python script on how to download the bulk reports, so I went to the GC Digital Fellows office hours for some help. Jojo kindly offered to work with me, but we ran into some problems not being able to load certain libraries. I’m still a beginner with Python, so my attempts to use Python to scrape the report data from HTML files had to stop there.
¶ 18 Leave a comment on paragraph 18 0 My next idea was to bulk download the reports using the Firefox extension DownThemAll. I took the last column of the reports.csv file on the website, which contained links to the latest HTML fragments of the report. I used Sublime Text to convert that long list into an HTML file, using the find and replace functions to add the website path, as well as <br> tags at the end of each filename. I uploaded the HTML file to my website and downloaded 7213 HTML reports to my hard drive. I then used HTMLAsText (found through a Google search) to convert those HTML files into plain txt files. Both tools worked quite well.
¶ 20 Leave a comment on paragraph 20 0 I used AntConc to analyze the concordances in the downloaded reports. I wanted to perform keyword in context analysis, which I learned more about during Micki Kaufman’s workshop on data visualization. Folgerpedia provides a good analysis of keyword in context:
¶ 21 Leave a comment on paragraph 21 0 A type of concordance output that sorts and aligns words within a textual sample alphabetically and in conjunction with surrounding text. Instead of isolating search terms in a list of individual words, KWIC allows users to see the results of a search within a limited context, providing a fuller meaning.
¶ 22 Leave a comment on paragraph 22 0 I wanted to analyze the discourse around a certain policy issue, such as big data, and see how discussions have potentially morphed over time–how the meanings of big data were communicated to policy-makers by these non-partisan analysts at CRS and the implications of these meanings for public administration and science and technology policy.
¶ 23 Leave a comment on paragraph 23 0 I loaded the corpus into AntConc. I will do more research into how I can maximize AntConc as well as how to learn the inner workings of concordance analysis, but here is a preliminary screenshot:
- ¶ 26 Leave a comment on paragraph 26 0
- The first time that CRS used the term “big data” was in 2012
- Technical keywords: analytics, visualization
- The word privacy came up – in what context? Civil rights?
- Big data is described as both a “big challenge” and as an “opportunity”
¶ 28 Leave a comment on paragraph 28 0 I uploaded the html and txt files of the reports to http://clairebalani.com/dataproject/. I also uploaded the original reports.csv obtained from everycrsreport.com, as well as my modifications.
¶ 30 Leave a comment on paragraph 30 0 One challenge that I need to rectify is the complete uploading of the actual data to my server. I used Filezilla to initiate an FTP transfer, but I ran into some problems with certain uploads.
¶ 31 Leave a comment on paragraph 31 0 Another challenge is ensuring the integrity of the data and a complete dataset. While HTMLAsText seemed to convert the exact number of HTML files to txt files without duplicates, but I need to make sure. I will also have to take into account scraping data from PDF files from the Every CRS Report website that haven’t yet been made available as an HTML file. Additionally, if the website releases more reports, I should probably establish a cut-off date for my data analysis.
¶ 32 Leave a comment on paragraph 32 0 There is also the question of cleaning the text within each report. Jojo helpfully pointed out that I may want to re-consider including footnotes within each report, and to think critically about which sections of the report I want to include in an analysis (the summary? Or just the conclusion?) I’ll probably refer back to Yanow and other interpretive policy analysts to see how they’ve treated document analysis in their own research in order to make a final decision.
¶ 33 Leave a comment on paragraph 33 0 Finally, I want to visualize the textual analysis I perform using AntConc. In this early stage, I’m thinking of using Gephi to perform social network analysis to visualize as nodes the different institutions, people, places, and things that were mentioned in these reports.