Analyzing Voter Turnout Data with R

Part 2: Joining Tabular Data to Shapefiles

Data for Democracy, Fall 2024
Andy Lyons

https://ajlyons.github.io/dfd2024/

Outline

better ways to import tabular data
importing Shapefiles and other GIS data on disk
importing spatial data from US Census with tidyCensus
joining tables
reshaping data from wide to long
Exercise 2. Import and Map Voter Turnout Data

Voter Turnout Data

Responsibility of counties and states to collect and report voting precinct boundaries and turnout

Availability, quality, and format varies a lot from state-to-state

Organizations & initiatives that collect and publish election data:

state election boards
Redistricting Data Hub - project under New Venture Fund (NVF), implemented by HaystaqDNA
MGGG Redistricting Lab - Tufts University, Cornell

Redistricting Data Hub

https://redistrictingdatahub.org/

RDH Mission

The nonpartisan Redistricting Data Hub provides individuals, good government organizations, and community groups the data, resources, and knowledge to participate effectively in the redistricting process.

Resources

RDH collects, cleans, and aggregates census election data including:

Precinct Boundaries and Election Results
Voter Files
Incumbent Addresses
Population Projections
PL 94-171
Legislative Boundaries
American Community Survey (ACS)
Citizen Voting Age Population (CVAP)
TIGER Boundary Files
Community of Interest (COI) maps
Public redistricting testimony
Official adjusted state redistricting datasets
Official adopted legislative boundaries

Download data with a free account
Most layers are available as a Shapefile and CSV
Training, Case Studies, Articles, FAQs

Better Ways to Import Tabular Data

These packages allow you to:

import tabular data in different formats
skip rows that don’t actually contain data
specify whether the file contains a header
rename columns as part of the importing
specify the data type for each column

Example:

library(readxl)
my_tbl = read_xlsx(path = "plot_data.xlsx", 
                   sheet = "Sheet2",
                   skip = 3,
                   col_names = c("plot_num", "date", "species", "count"),
                   col_types = c("text", "date", "text", "integer"))

Importing Vector GIS Data with `sf`

To import GIS data with sf, you have to specify a source and layer.

The Source can be a:

folder: "./gis_data"

geodatabase (which is really a folder): "./gis_data/yose_roads.gdb"

file: "trail_heads.kml", "cell_towers.geojson"

database connection string: "PG:dbname=postgis"

In sf functions, the argument where you provide the source is often named dsn (data source name)

The Layer can be a:

Shapefile name (minus the shp extension)

a named layer in the database / file

Import Functions

The two main functions for importing vector data are:

st_layers(source) - returns the names of available layers in a source

st_read(source, layer) - import into R

Most of the functions in the sf package start with st_, which stands for ‘space time’, and matches the names of similar functions in PostGIS.

To view the metadata of a layer before bringing it into R, use rgdal::ogrInfo()

Importing Shapefiles

Shapefile format

Old format dating back to the early 90s
Originally from ESRI
Well-supported by nearly all GIS software
Actually requires 4-9 files with extensions \*.shp, \*.shx, \*.prj, \*.dbf, etc.
Often shared as a zip file
Column names in attribute table limited to 10 characters
Better options available today

Importing a Shapefile

st_read(dsn, layer)

dsn - directory, or shp filename
layer - shp filename (minus .shp), or omitted

Example:

library(sf)
yose_bnd_ll <- st_read(dsn="./data", layer="yose_boundary")

## This would also work:
## yose_bnd_ll <- st_read(dsn="./data/yose_boundary.shp")

View contents:

yose_bnd_ll

## Simple feature collection with 1 feature and 11 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -119.8864 ymin: 37.4947 xmax: -119.1964 ymax: 38.18515
## Geodetic CRS:  North_American_Datum_1983
##   UNIT_CODE
## 1      YOSE
##                                                                             GIS_NOTES
## 1 Lands - http://landsnet.nps.gov/tractsnet/documents/YOSE/METADATA/yose_metadata.xml
##                UNIT_NAME  DATE_EDIT STATE REGION GNIS_ID     UNIT_TYPE
## 1 Yosemite National Park 2016-01-27    CA     PW  255923 National Park
##   CREATED_BY                                               METADATA PARKNAME
## 1      Lands http://nrdata.nps.gov/programs/Lands/YOSE_METADATA.xml Yosemite
##                         geometry
## 1 POLYGON ((-119.8456 37.8327...

Plot:

plot(yose_bnd_ll$geometry, axes=TRUE)

Joining Tables

Join operations

join data frames on a column	left_join(), right_join(), inner_join()
stack data frames	bind_rows()

Join tables on a common column

To join two data frames based on a common column, you can use:

left_join(x, y, by)

where x and y are data frames, and by is the name of a column they have in common.

If there is only one column in common, and if it has the same name in both data frames, you can omit the by argument.

If the common column is named differently in the two data frames, you can deal with that by passing a named vector as the by argument. See below.

To illustrate a table join, we’ll first import a csv with some fake data about the genetics of different iris species:

# Create a data frame with additional info about the three IRIS species
iris_genetics <- data.frame(Species=c("setosa", "versicolor", "virginica"),
                          num_genes = c(42000, 41000, 43000),
                          prp_alles_recessive = c(0.8, 0.76, 0.65))

iris_genetics

##      Species num_genes prp_alles_recessive
## 1     setosa     42000                0.80
## 2 versicolor     41000                0.76
## 3  virginica     43000                0.65

We can join these additional columns to the iris data frame with left_join():

iris |> 
  left_join(iris_genetics, by = "Species") |> 
  slice(1:10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species num_genes
## 1           5.1         3.5          1.4         0.2  setosa     42000
## 2           4.9         3.0          1.4         0.2  setosa     42000
## 3           4.7         3.2          1.3         0.2  setosa     42000
## 4           4.6         3.1          1.5         0.2  setosa     42000
## 5           5.0         3.6          1.4         0.2  setosa     42000
## 6           5.4         3.9          1.7         0.4  setosa     42000
## 7           4.6         3.4          1.4         0.3  setosa     42000
## 8           5.0         3.4          1.5         0.2  setosa     42000
## 9           4.4         2.9          1.4         0.2  setosa     42000
## 10          4.9         3.1          1.5         0.1  setosa     42000
##    prp_alles_recessive
## 1                  0.8
## 2                  0.8
## 3                  0.8
## 4                  0.8
## 5                  0.8
## 6                  0.8
## 7                  0.8
## 8                  0.8
## 9                  0.8
## 10                 0.8

If you need to join tables on multiple columns, add additional column names to the by argument.

Join columns must be the same data type (i.e., both numeric or both character).

There are several variants of left_join(), the most common being right_join() and inner_join(). See help for details.

Joining Tables When the Column Name is Different

If the join column is named differently in the two tables, you can pass a named character vector as the by argument. A named vector is a vector whose elements have been assigned names. You can construct a named vector with c().

For example if the join column was named ‘SpeciesName’ in x, and just ‘Species’ in y, your expression would be:

left_join(x, y, by = c("SpeciesName" = "Species"))

Reshaping Data

Reshaping data includes:

turning rows into columns (aka pivot tables, cross tab query)
turning columns into rows

The go-to Tidyverse package for reshaping data frames is tidyr

Pivot Functions

pivot_longer()

pivot_wider()

More info and examples in the tidyr Pivoting Vignette

Exercise 2: Import and Map Voter Turnout Data

In this exercise, we will:

import the 2020 voting tabulation districts (VTDs) for Camden County NJ from a Shapefile
import a CSV with voter turnout data from the 2020 primary election
join the tabular voter turnout data to the VTD polygons
save the data to disk (for use in other exercises)
map the voter turnout for the July 2020 primary
reshape the attribute table from a wide to long format
create facet maps (i.e., one map for each subset of the data)

https://posit.cloud/content/8521414

Analyzing Voter Turnout Data with R

Part 2: Joining Tabular Data to Shapefiles

Outline

Voter Turnout Data

Redistricting Data Hub

RDH Mission

Resources

Better Ways to Import Tabular Data

Importing Vector GIS Data with sf

Import Functions

Importing Shapefiles

Shapefile format

Importing a Shapefile

Example:

Joining Tables

Join operations

Join tables on a common column

Joining Tables When the Column Name is Different

Reshaping Data

Pivot Functions

Exercise 2: Import and Map Voter Turnout Data

Break!

Importing Vector GIS Data with `sf`