Visualizing data is very important for understanding that data. All sorts of otherwise unknown patterns can slip by if we don’t take the time to look at our data. For example:
While the Datasaurus is fun, it is also a cautionary tale. Along the top of the plot there are various numerical representations of the data, all of which look completely normal! Such oddities can easily slip past us without proper visualization.
Visualization also serves another crucial purpose; to communicate our findings with other people. Dealing with and understanding data is an acquired skill, one you’re already more familiar with than the vast majority of people. Visualizations often allow you to quickly communicate your findings with people who are not as well acquainted with data.
Today we will be learning some more advanced plotting tools. While we will often rely on the 5 canonical plots we learned before, a lot can be done to spruce them up. We will primarily be using the ggplot2
package for this task. While it is very powerful, it takes more effort to make even a simple ggplot. That effort is often worth it however if the intent is to share your visualization.
I highly recommend you open up the ggplot2 cheatsheet while you work.
Load in the data for today by running the following:
nbi_hampshire = read.csv("https://raw.githubusercontent.com/Intro-to-Data-Science-Template/intro_to_data_science_reader/main/content/class_worksheets/12_adv_plot/data/nbi_hampshire.csv")
Today we will be using data from the 2022 National Bridge Inventory Dataset. Specifically, we will be looking at vehicle bridges in Hampshire county (where we are). We’re going to look at what kinds of variables might contribute to poor bridge conditions, where there are poor bridge conditions, and which entities are responsible for maintaining them. The data documentation for this dataset is quite thick, so I will provide you with a data dictionary for today.
Today we will be using a package called esquisse
to help us make our plots. It provides an interactive interface to put together the basic components of our plots. It is not perfect, and cannot be used to create a fully realized visualization, but it can often do the first 50% for you pretty quickly. Install it using the following:
install.packages("esquisse")
You will need to fully close R Studio and restart it after installing esquisse
. This is so the addin functionality can work.
Once you have restarted R Studio and re-loaded the data, we can start using using esquisse
to build draft plots. Start the UI by going to the Addins menu at the top of R Studio and clicking on the ‘ggplot2’ builder option under esquisse
.
esquisse
will load for a second, and then ask you to select a data source. Pick our nbi_hampshire
dataframe and click Import Data.
After you have imported your data, you will be taken to the plot builder. I have created a diagram of the different elements below. Along the top are the variables from your dataframe. You can click and drag the variables into the various element areas below. On the left of this section will be a display of what plot esquisse
thinks may work well given your variables and elements, but you can click on it to manually change it. In the center is the plot preview, which will automatically update as you make changes. At the bottom is the options bar, with several other menus you can go through to adjust your plot. Importantly, in the far right option menu you can copy the code which will generate the plot you are looking at! You will want to copy that code into your document so you can remake this plot later, and further adjust it.
Open up the plot builder, and import the nbi_hampshire
dataframe. For our first variable, click and drag YEAR_BUILT_027
into the “X” elements box. This will put YEAR_BUILT_027
on our X axis. Next, click and drag STRUCTURAL_EVAL_067
into the “Y” element box to place in on the Y axis. In a moment the plot preview area should update, creating a scatter plot.
You can continue adding variables to elements to further develop the plot.
Click and drag the ROUTE_PREFIX_005B_L
variable into the “color” element. How does this change the plot?
Adds route type as color to the dots in the scatterplot.
Once you have added the new element, click on the plot type window next to the variables. In the menu that pops up, select “Jitter.” A jitter-ed scatter plot adds a little random noise to the dot locations so that multiple dots that are in the same spot can be seen.
Once that is done, look toward the bottom of the plot builder at the options menus. In the “Labels & Title” menu, add a title, labels for the X and Y axes, and a label for the colors.
You can add further refinements if you would like. Once you are done, go to the last option menu that says “Code.” Open that menu to see the ggplot2
code creating your plot! Copy the code and paste it into a script to continue working on it.
esquisse
is a helpful shortcut in getting started, but you will almost always need to do some fine-tuning of the resulting code. The first step of that however is understanding all the component parts.
ggplot
builds plots in layers. It combines those layers using a (completely unique, not used for anything else) syntax which uses the +
sign to combine layers. The most common layers are as follows, and you should be able to see them in your esquisse
output.
ggplot(<DATA>) +
$
.aes(x = <VARIABLE>, y = <VARIABLE>, color = <VARIABLE>) +
aes()
or “aesthetic mappings” tell ggplot what variables belong where. These are the elements boxes we see in esquisse
. You can either define the aes
alone, in which case it will use the same variables for all layers of the plot. Alternatively, you can define the aes
for a specific geom_XXXX()
layer as we will see next.geom_jitter(size = <VALUE>) +
geom_XXXX()
layers. One geom defines one type of plot to layer on. For example, here we have a geom_jitter()
which adds a jittered scattered plot layer. We could also use a geom_bar()
for a bar plot, a geom_histogram()
for a histogram, etc. We could define aes
inside a geom if we wanted, instead of outside like we did before, in which case the data would only apply for that layer. We could thus theoretically layer on multiple datasets in one plot.labs() +
labs()
function lets us add labels, titles, and captions to our plot. You will usually at least want to add the title = <CHARACTER>
, x = <CHARACTER>
, and y = <CHARACTER>
arguments.theme_minimal()
theme_minimal()
as it cuts away everything that isn’t useful.All of these elements build up to something that will look about like this:
ggplot(nbi_hampshire) +
aes(
x = YEAR_BUILT_027,
y = STRUCTURAL_EVAL_067,
color = ROUTE_PREFIX_005B_L
) +
geom_jitter(size = 1.5) +
labs(title = "Vehicle Bridges in Hampshire County",
x = "Year",
y = "Structural Evaluation Score",
color = "Type") +
theme_minimal()
Create a box plot using ggplot/esquisse which shows STRUCTURAL_EVAL_067
by MAINTENANCE_021_L
.
ggplot(nbi_hampshire) + aes(x = MAINTENANCE_021_L, y = STRUCTURAL_EVAL_067) + geom_boxplot(fill = ‘#112446’) + theme_minimal()