¶ 1 Leave a comment on paragraph 1 0 My data set (housed here) comes from the IRS’s Statistics of Income (SOI) program, which no doubt sounds like the dullest source ever for a DH project. Bear with me! Due to the nature of its record-keeping, the IRS is home to a large collection of data on U.S. migration, based on address information reported annually via individual tax returns. This data is available at both the state and county levels dating back to 1991, and is thus a more useful indicator of the movement of people than the Census’s American Community Survey (ACS) data, which only captures county-to-county migration flow beginning in 2005. Naturally, there are a number of limitations to the use of IRS data in this way, which I address at the end of this post.
¶ 2 Leave a comment on paragraph 2 0 The data set I have chosen contains county-to-county migration data for Virginia from 1992 to 1993. I selected this data set because I am interested in demography, particularly with respect to understanding changes in voting patterns, and Virginia makes an interesting case study as a state that “flipped” from red to blue over the past twenty years (although 538* has it down as a swingy “tipping point” state this election cycle). The Pew Research Center has done survey work indicating that American values have diverged in the past twenty years only along partisan lines, suggesting that something other than ideological polarization is at work in the American electorate. Are Americans merely becoming more adept at sorting themselves into the party that best aligns with their values? Or is geographic sorting taking place as well? In the interest of testing the latter, I am interested in quantifying the extent to which changes in county-level returns in presidential elections (which are available here) can be explained by the level of migration experienced by that county in the four years preceding those elections. Within the context of that larger research project, then, the IRS’s SOI data on migration will serve as my independent variable, starting with 1992 to 1993; since data is only available from 1991, the first four-year period I will be looking at is 1992 to 1996.
¶ 3 Leave a comment on paragraph 3 0 In the meantime, I would like to create a map to visualize annual migration levels because a) data viz, and b) maps! Part of the reason I chose Virginia for this project is that it contains areas of both low and high migration levels, and I am interested in seeing that information represented visually (which will no doubt raise new questions, as maps do). I’m envisioning a gif of a series of heat maps, one for each year from 1992 to 2012, with counties lit up depending on the degree to which they lost or gained residents in a given year. Maybe I can even make something interactive!
¶ 4 Leave a comment on paragraph 4 0 But before I can get to that, I need to gather and process migration data for each of those years. Working with the 1992 to 1993 data set, I wrote a program in R that calculates the migration levels for each county, which I can re-use for each of the years I need. You can find the details at my github, but here’s a broad breakdown: I read the data into R, and then created and altered a series of data frames to pare down the information to only what I needed. You may have noticed that my “data set” is actually two data sets, as the in-migration and out-migration data are housed in separate .xls files, so I actually had to process and clean each separately before merging them into one data frame. Further, true to its title (“county-to-county), the SOI data contains not only information about the inflow and outflow of people for each county, but also details about where those people moved to or from, which is information I do not need for this project and therefore needed to remove. I also renamed columns to reflect the information for which the IRS data acts as proxy: number of returns for number of households and number of exemptions claimed for number of individuals. (I am concerned with the latter.) I also separated out the Federal Information Processing Standard (FIPS) county numbers, as I will need to have these handy when I ultimately merge these numbers with election data for regression and shape files for mapping.
¶ 5 Leave a comment on paragraph 5 0 As I mentioned before, there are several limitations to the IRS SOI data I am using. While it has a few advantages over ACS data (mainly, a fifteen-year head start and the fact that its numbers come from actual migration rather than survey data), its most severe limitation is also its most obvious: it only contains information for people who file tax returns. The movement of undocumented people, unhoused people, full-time students, and anyone else not filing taxes will not be reflected in this data. For my purposes, this limitation is more of a problem with respect to the mapping portion of my project, since I will be unable to represent all actual migration that takes place; it poses less of a threat to my larger research project, as I suspect there is a large overlap between those who don’t vote and those who are left out of the IRS data. In any case, this is a limitation that I will need to keep in mind as I move forward. A final and smaller issue with this data is that despite being “historical” by ACS standards, 1991 was not so very long ago, and not having county-to-county data prior to that year limits the possible scope of the project to 25 years. More historical data would allow for a stronger case.
¶ 6 Leave a comment on paragraph 6 0 For anyone interested in the nuts and bolts of my R program, here is a(nother) link to my github: https://github.com/cpfisher11/VA-Migration. Come visit and tell me all the ways I could have avoided creating two dozen similarly named data frames!