Data Project: Virginia is for Movers

1 Leave a comment on paragraph 1 0 My data set (housed here) comes from the IRS’s Statistics of Income (SOI) program, which no doubt sounds like the dullest source ever for a DH project. Bear with me! Due to the nature of its record-keeping, the IRS is home to a large collection of data on U.S. migration, based on address information reported annually via individual tax returns. This data is available at both the state and county levels dating back to 1991, and is thus a more useful indicator of the movement of people than the Census’s American Community Survey (ACS) data, which only captures county-to-county migration flow beginning in 2005. Naturally, there are a number of limitations to the use of IRS data in this way, which I address at the end of this post.

2 Leave a comment on paragraph 2 0 The data set I have chosen contains county-to-county migration data for Virginia from 1992 to 1993. I selected this data set because I am interested in demography, particularly with respect to understanding changes in voting patterns, and Virginia makes an interesting case study as a state that “flipped” from red to blue over the past twenty years (although 538* has it down as a swingy “tipping point” state this election cycle). The Pew Research Center has done survey work indicating that American values have diverged in the past twenty years only along partisan lines, suggesting that something other than ideological polarization is at work in the American electorate. Are Americans merely becoming more adept at sorting themselves into the party that best aligns with their values? Or is geographic sorting taking place as well? In the interest of testing the latter, I am interested in quantifying the extent to which changes in county-level returns in presidential elections (which are available here) can be explained by the level of migration experienced by that county in the four years preceding those elections. Within the context of that larger research project, then, the IRS’s SOI data on migration will serve as my independent variable, starting with 1992 to 1993; since data is only available from 1991, the first four-year period I will be looking at is 1992 to 1996.

3 Leave a comment on paragraph 3 0 In the meantime, I would like to create a map to visualize annual migration levels because a) data viz, and b) maps! Part of the reason I chose Virginia for this project is that it contains areas of both low and high migration levels, and I am interested in seeing that information represented visually (which will no doubt raise new questions, as maps do). I’m envisioning a gif of a series of heat maps, one for each year from 1992 to 2012, with counties lit up depending on the degree to which they lost or gained residents in a given year. Maybe I can even make something interactive!

4 Leave a comment on paragraph 4 0 But before I can get to that, I need to gather and process migration data for each of those years. Working with the 1992 to 1993 data set, I wrote a program in R that calculates the migration levels for each county, which I can re-use for each of the years I need. You can find the details at my github, but here’s a broad breakdown: I read the data into R, and then created and altered a series of data frames to pare down the information to only what I needed. You may have noticed that my “data set” is actually two data sets, as the in-migration and out-migration data are housed in separate .xls files, so I actually had to process and clean each separately before merging them into one data frame. Further, true to its title (“county-to-county), the SOI data contains not only information about the inflow and outflow of people for each county, but also details about where those people moved to or from, which is information I do not need for this project and therefore needed to remove. I also renamed columns to reflect the information for which the IRS data acts as proxy: number of returns for number of households and number of exemptions claimed for number of individuals. (I am concerned with the latter.) I also separated out the Federal Information Processing Standard (FIPS) county numbers, as I will need to have these handy when I ultimately merge these numbers with election data for regression and shape files for mapping.

5 Leave a comment on paragraph 5 0 As I mentioned before, there are several limitations to the IRS SOI data I am using. While it has a few advantages over ACS data (mainly, a fifteen-year head start and the fact that its numbers come from actual migration rather than survey data), its most severe limitation is also its most obvious: it only contains information for people who file tax returns. The movement of undocumented people, unhoused people, full-time students, and anyone else not filing taxes will not be reflected in this data. For my purposes, this limitation is more of a problem with respect to the mapping portion of my project, since I will be unable to represent all actual migration that takes place; it poses less of a threat to my larger research project, as I suspect there is a large overlap between those who don’t vote and those who are left out of the IRS data. In any case, this is a limitation that I will need to keep in mind as I move forward. A final and smaller issue with this data is that despite being “historical” by ACS standards, 1991 was not so very long ago, and not having county-to-county data prior to that year limits the possible scope of the project to 25 years. More historical data would allow for a stronger case.

6 Leave a comment on paragraph 6 0 For anyone interested in the nuts and bolts of my R program, here is a(nother) link to my github: Come visit and tell me all the ways I could have avoided creating two dozen similarly named data frames!

7 Leave a comment on paragraph 7 0  

8 Leave a comment on paragraph 8 0 *I am counting down the seconds until I can stop compulsively refreshing that map. Click through at your own risk!

This entry was posted in Uncategorized. Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. Posted November 7, 2016 at 10:14 pm | Permalink

    Quick GitHub suggestion: in, don’t start each paragraph with the pound/hash sign (#). GitHub, and markdown, use # as the equivalent of <h1> in HTML. To render text as a paragraph, just insert a line break after it.

    Here’s a good guide:

    • Posted November 7, 2016 at 11:14 pm | Permalink

      Oh my goodness, thank you! That is very helpful. I see a hash sign and automatically think “comment!” Can you tell I’ve been spending all my time in R? 🙂

      • Posted November 8, 2016 at 3:51 am | Permalink

        No problem! I love markdown as an easy-to-use markup language, but the hash sign can throw off users at first.

  2. Posted November 12, 2016 at 7:12 pm | Permalink

    Not dull at all. Your project is super timely! I almost wrote “relevant,” but after the discussion in class I hesitate to use that word, since it’s subjective. Well, relevant to me, and to anyone who cares about electoral politics in the United States.

  3. Posted November 16, 2016 at 6:19 pm | Permalink

    Carolyn, I’ll echo Jenna’s sentiments about timeliness! Something that would be useful to hear is about how you made decisions about what served as your guide while manipulating your data in R. Did you use an existing package that others who do similar work use? How might you check your work? There’s a lot to work with here, and certainly a lot to think about in terms of the relationship between migration and causation/correlation to election results. Really impressive work, though, branching out to new technologies and experimenting in public. Looking forward to next steps!

  4. Posted November 19, 2016 at 3:00 pm | Permalink

    Carolyn: Now that the election’s over and Clinton “won” Virginia, your analysis is even more timely in that you were correct to want to analyze the evolution of Virginia from purple to blue in an election that swung the other way. I’d like to hear more about how you plan to analyze in and out migration and what the various data points you will be using from the IRS dataset on income. Are you analyzing every county in the state (which has many) or will you focus on several, especially the ones in northern Virginia which have experienced the greatest inflow and which have changed the most politically? I was also puzzled by your use of the term “American values” in your post and wasn’t entirely clear on what you meant. Is that Pew’s term or your own? If the former, how do they define those values?

  5. Posted July 1, 2017 at 2:16 pm | Permalink

    Thank you for your post.

Additional comments powered byBackType

  • Archives

  • Welcome to Digital Praxis 2016-2017

    Encouraging students think about the impact advancements in digital technology have on the future of scholarship from the moment they enter the Graduate Center, the Digital Praxis Seminar is a year-long sequence of two three-credit courses that familiarize students with a variety of digital tools and methods through lectures offered by high-profile scholars and technologists, hands-on workshops, and collaborative projects. Students enrolled in the two-course sequence will complete their first year at the GC having been introduced to a broad range of ways to critically evaluate and incorporate digital technologies in their academic research and teaching. In addition, they will have explored a particular area of digital scholarship and/or pedagogy of interest to them, produced a digital project in collaboration with fellow students, and established a digital portfolio that can be used to display their work. The two connected three-credit courses will be offered during the Fall and Spring semesters as MALS classes for master’s students and Interdisciplinary Studies courses for doctoral students.

    The syllabus for the course can be found at

  • Categories

Need help with the Commons? Visit our
help page
Send us a message
Skip to toolbar