Geospatial Data

Overview

library(sf)
library(tidycensus)
library(tmap)

The Data

Today we’ll be looking at the Massachusetts Historical Commission Historic Inventory data. This data set includes several locations considered “historic assets” of the commonwealth. We’re going to see how many of them are close to Smith college, and maybe spark ideas for a few weekend trips.

The zip file containing the data can be downloaded from here under the downloads section. You want the “Download Geodatabase (17 MB)” version. Once you have it downloaded, place it somewhere easy to access. Do not unzip the file, we can load it into R as is. You will use read_sf() from the sf package to load in the data.

# Read in the MA Historic Inventory
ma_hi = read_sf("Path/to/data/MHC_Inventory_GDB.zip")

We will also want the boundaries of Massachusetts to serve as a background, so let’s use tidycensus to grab that. You will need to register for a census API key, and set it using census_api_key(). Once that is done, we can use get_acs() to perform an API call for census data, including geospatial data.

# Set api key if you do not have one.
# I do so I will skip this
# census_api_key("YOUR KEY HERE", install = TRUE)

# get county boundaries in MA
ma_counties <- get_acs(geography = "county", 
              variables = c(total_pop = "B01003_001"), 
              state = "MA", 
              geometry = TRUE,
              year = 2020)

The Census has tons of useful geospatial boundaries. I’ll include the most common ones here for reference.

GeographyDefinitionAvailable byAvailable in
“us”United Statesget_acs(), get_decennial()
“region”Census regionget_acs(), get_decennial()
“division”Census divisionget_acs(), get_decennial()
“state”State or equivalentstateget_acs(), get_decennial()
“county”County or equivalentstate, countyget_acs(), get_decennial()
“county subdivision”County subdivisionstate, countyget_acs(), get_decennial()
“tract”Census tractstate, countyget_acs(), get_decennial()
“block group”Census block groupstate, countyget_acs(), get_decennial()
“block”Census blockstate, countyget_decennial()
“place”Census-designated placestateget_acs(), get_decennial()
“metropolitan statistical area/micropolitan statistical area”Core-based statistical areastateget_acs(), get_decennial()
“combined statistical area”Combined statistical areastateget_acs(), get_decennial()
“urban area”Census-defined urbanized areasget_acs(), get_decennial()
“congressional district”Congressional district for the year-appropriate Congressstateget_acs(), get_decennial()
“school district (elementary)”Elementary school districtstateget_acs(), get_decennial()
“school district (secondary)”Secondary school districtstateget_acs(), get_decennial()
“school district (unified)”Unified school districtstateget_acs(), get_decennial()
“zip code tabulation area” OR “zcta”Zip code tabulation areaget_acs(), get_decennial()
“state legislative district (upper chamber)”State senate districtsstateget_acs(), get_decennial()
“state legislative district (lower chamber)”State house districtsstateget_acs(), get_decennial()
“voting district”Voting districts (2020 only)stateget_decennial()

Checking the Coordinate Reference System (CRS)

One of the first steps when working on any spatial data is to check the Coordinate Reference System (CRS). Good documentation will always tell you what the CRS of your data is. The Massachusetts Historical Commission Historic Inventory does not state the CRS on its website. However, more advanced spatial data types include this information in the metadata of the file itself. We can check the CRS using the st_crs() function.

# check the crs of the ma_hi data
st_crs(ma_hi)
Coordinate Reference System:
  User input: NAD83 / Massachusetts Mainland 
  wkt:
PROJCRS["NAD83 / Massachusetts Mainland",
    BASEGEOGCRS["NAD83",
        DATUM["North American Datum 1983",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4269]],
    CONVERSION["SPCS83 Massachusetts Mainland zone (meters)",
        METHOD["Lambert Conic Conformal (2SP)",
            ID["EPSG",9802]],
        PARAMETER["Latitude of false origin",41,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8821]],
        PARAMETER["Longitude of false origin",-71.5,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8822]],
        PARAMETER["Latitude of 1st standard parallel",42.6833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8823]],
        PARAMETER["Latitude of 2nd standard parallel",41.7166666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8824]],
        PARAMETER["Easting at false origin",200000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8826]],
        PARAMETER["Northing at false origin",750000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8827]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["United States (USA) - Massachusetts onshore - counties of Barnstable; Berkshire; Bristol; Essex; Franklin; Hampden; Hampshire; Middlesex; Norfolk; Plymouth; Suffolk; Worcester."],
        BBOX[41.46,-73.5,42.89,-69.86]],
    ID["EPSG",26986]]
# check the crs of census data
st_crs(ma_counties)
Coordinate Reference System:
  User input: NAD83 
  wkt:
GEOGCRS["NAD83",
    DATUM["North American Datum 1983",
        ELLIPSOID["GRS 1980",6378137,298.257222101,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4269]]

Right now our references systems do not match. We need to make sure they do, otherwise any plots we make or analyses we do will be misaligned. We can change once CRS to another using the st_transform() function. Here I’ll change the ma_counties CRS to match the historic sites data.

# change the crs to match
ma_counties_transform = st_transform(x = ma_counties, st_crs(ma_hi))

# check the crs matches
st_crs(ma_hi) == st_crs(ma_counties_transform)
[1] TRUE

Creating Maps

Simple Maps

To start out mapping journey, let’s just map our historic sites. We can do that using the tmap package. It works in a similar way to ggplot, in that it builds layers on top of each other to create a final visualization. Here I say look at my ma_counties data, and draw polygons from that data.

# plot the basic counties
tm_shape(ma_counties_transform) +
  tm_polygons()

If I wanted to shade the counties by population (already included in my county data from our census api call), we just need to tell the polygons to be shaded by some variable. Note that is only works in this case because I have only one variable in a long format dataframe. If I had more, I would need to transform to wide first.

# plot with polygons shaded by population
tm_shape(ma_counties_transform) +
  tm_polygons(col = "estimate")

We can change how things are shaded with a few arguments. For example, below I change the color palette, and make it so the scale is continuous rather the categorical.

# plot with polygons shaded by population
tm_shape(ma_counties_transform) +
  tm_polygons(col = "estimate", palette = "viridis", style = "cont")

We can use tm_layout() and tm_credits() to add all of our important meta information. I also add a “title” to the polygons to change the legend title.

# plot with metadata
tm_shape(ma_counties_transform) +
  tm_polygons(col = "estimate", palette = "viridis", style = "cont", title = "Population") +
  tm_layout(title = "Population by county in Massachusetts",
            frame = FALSE) +
  tm_credits("American Community Survey 5-year estiamtes 2020", position = c("center", "bottom"))

Interactive Maps

The cool thing about tmap is that is it super easy to turn your static maps interactive. All you need to do is call tmap_mode(), and set the mode to “view” rather than plot. From then on, any maps you plot will be interactive! You can switch back by calling tmap_mode() again and setting it to “plot.”

The polygons in the interactive map show here are slightly misaligned to the base map. This is an issue with this website, and should not be present when making your maps in R.

# change modes
tmap_mode("view")

# plot the same map as above
tm_shape(ma_counties_transform) +
  tm_polygons(col = "estimate", palette = "viridis", style = "cont", id = "NAME", title = "Population")
# change back
tmap_mode("plot")

Geospatial Aggregation and Subsetting

A geospatial join works a bit differently than the other joins we have done. Rather than trying to match data by some value, we do some kind of spatial comparison. For example, are certain points within a specified area? Do two areas touch? Are points within a certain distance from each other? These are the sorts of questions you are merging and splitting by in a spatial context.

To start with, let’s quickly plot out historic assets on top of the counties.

# plot historic sites over the state
tm_shape(ma_counties_transform) +
  tm_polygons() +
  tm_shape(ma_hi) +
  tm_polygons(col = "blue")

For the sake of simplicity, let’s subset the data to only look at Hampshire county. We can do that using the following. First I subset Hampshire country out of our full state data. Then, I say give me all the historic sites that are within Hampshire county. I do this using the st_filter() function, and for join type I use st_within. There are several other comparators, which you can see here, and are explained in some detail here.

# get just Hampshire county
hampshire_county = ma_counties_transform[ma_counties_transform$NAME == "Hampshire County, Massachusetts", ]

# get only the locations completely within Hampshire county (so if anything is on the boundary, it will be dropped)
hampshire_sites = st_filter(ma_hi, hampshire_county, join = st_within, left = FALSE)

# plot our results
tm_shape(hampshire_county) +
  tm_polygons() +
  tm_shape(hampshire_sites) +
  tm_polygons(col = "blue")

We’re close, but we still have location that are only partially inside the county. This isn’t necessarily bad, but let’s trim off all the extra bits for now. We can do that using st_intersection(), which will return only the area where both our inputs intersect.

# Trim the historic sites so that we only have the areas within the county
trimmed_sites = st_intersection(hampshire_county, hampshire_sites)

# plot the new trimmed sites.
tm_shape(hampshire_county) +
  tm_polygons() +
  tm_shape(trimmed_sites) +
  tm_polygons(col = "blue")

Geospatial Computations

Now that we have our data in a state that is easier to work with, let’s ask some questions of it. First, let’s see how many sites are within 5 miles of Smith college. To do this, we’ll need the spatial coordinates of the college. You can grab these off of Google maps by right clicking on a location. Smith collage is located at longitude/latitude 42.316643513274755, -72.64027428068538. Once we have that, we can ask “how many of our historic sites are within X distance of Y?”

First we need to turn the longitude and latitude of Smith College into a spatial point. I do that below, and make a quick plot to check if it all worked.

# turn the location of smith college into a spatial point
# fist make a data frame with the data
smith_college = data.frame("name" = "smith college", lon =  42.316643513274755, lat = -72.64027428068538)
# turn it into a spatial data frame with the right CRS (what is used by google maps)
smith_college =  st_as_sf(smith_college, coords = c("lat", "lon"), crs = "EPSG:4326")
# convert CRS to match our data
smith_college = st_transform(smith_college, crs = st_crs(hampshire_county))

# plot the college
tm_shape(hampshire_county) +
  tm_polygons() +
  tm_shape(smith_college) +
  tm_dots(col = "gold", size = 1)

Now that that is done, we can start making some comparisons. We first need to know what units are geospatial objects use. We can do this using the st_crs() function again. I’ll subset the results to just get the units used. We see that this object uses meters.

# get the units of length used in this object
st_crs(trimmed_sites, parameters = TRUE)$ud_unit
1 [m]

Given that, if we want to find what sites are within one mile, we would ask what things are within a distance of 1,609.344 meters. I’ll multiply that by 5 to get those within 5 miles of the college.

# find all the sites within a 5 miles of smith college
close_sites = st_is_within_distance(smith_college, trimmed_sites, dist = 1609.344 * 5)
smith_5_mile_sites = trimmed_sites[unlist(close_sites), ]

We can then make a pretty cool map showing the nearby points of interest! You can click on the areas to get info about them.

The polygons in the interactive map show here are slightly misaligned to the base map. This is an issue with this website, and should not be present when making your maps in R.

# filter sites for easier plotting later.
smith_5_mile_sites = smith_5_mile_sites[, c("HISTORIC_N", "TOWN_NAME", "USE_TYPE", "geometry")]

# set mode to interactive
tmap_mode("view")

# plot those sites
tm_shape(smith_5_mile_sites) +
  tm_polygons(col = "green") +
  tm_shape(smith_college) +
  tm_dots(col = "gold", size = 1)

Conclusion

Spatial data analysis is a great tool to have in your toolbox. It unlocks the ability to ask a lot of relevant questions about your world. If you would like to learn more, I encourage you to look at the sf package home page.