+ - 0:00:00
Notes for current slide
Notes for next slide

Intro to R for Social Data Science

Visualization 1

Merlin Schaeffer
Department of Sociology

2021-09-17

1 / 32

2 / 32
3 / 32
# Add packages to library
library(tidyverse) # Add the tidyverse package to my current library.
library(haven) # Read and handle SPSS, Stata & SAS data (no need to install)
library(essurvey) # Add ESS API package to library.
library(ggplot2) # Allows us to create nice figures.
# Import the ESS round 9 data via the API
ESS <- import_rounds(rounds = 9, ess_email = "YOUR-EMAIL", format = "spss")
4 / 32
# Add packages to library
library(tidyverse) # Add the tidyverse package to my current library.
library(haven) # Read and handle SPSS, Stata & SAS data (no need to install)
library(essurvey) # Add ESS API package to library.
library(ggplot2) # Allows us to create nice figures.
# Import the ESS round 9 data via the API
ESS <- import_rounds(rounds = 9, ess_email = "YOUR-EMAIL", format = "spss")
ESS <- transmute(ESS, # Recode several variables & keep only the recoded ones (i.e., transmute vs mutate).
idno = zap_labels(idno),
# Make the following variables factors:
cntry = as_factor(cntry),
gndr = as_factor(gndr),
facntr = as_factor(facntr),
mocntr = as_factor(mocntr),
# Make the following variables numeric:
imbgeco = max(imbgeco, na.rm = TRUE) - zap_labels(imbgeco), # Also turn scale around.
imueclt = max(imueclt, na.rm = TRUE) - zap_labels(imueclt), # Also turn scale around.
imwbcnt = max(imwbcnt, na.rm = TRUE) - zap_labels(imwbcnt), # Also turn scale around.
agea = zap_labels(agea),
pspwght = zap_labels(pspwght),
eduyrs = case_when(
eduyrs > 21 ~ 21, # Recode to max 21 years of edu.
eduyrs < 9 ~ 9, # Recode to min 9 years of edu.
TRUE ~ zap_labels(eduyrs) # Make it numeric.
),
)
4 / 32
# Case selection.
ESS <- dplyr::filter(ESS,
# Only respondents whose parents were born in country of interview.
facntr == "Yes" & mocntr == "Yes" &
# Only respondents from direct neighbors of Denmark:
(cntry == "Denmark" | cntry == "Germany" | cntry == "Sweden" | cntry == "Norway")
)
# Casewise deletion of missing values
(ESS <- drop_na(ESS))
# # A tibble: 5,354 × 11
# idno cntry gndr facntr mocntr imbgeco imueclt imwbcnt agea pspwght eduyrs
# <dbl> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10 Germany Female Yes Yes 0 5 5 65 0.854 11
# 2 64 Germany Female Yes Yes 1 2 2 74 0.760 20
# 3 65 Germany Male Yes Yes 6 6 7 64 1.08 12
# 4 91 Germany Female Yes Yes 2 2 2 54 1.27 14
# 5 150 Germany Female Yes Yes 0 0 4 71 0.942 12
# 6 212 Germany Male Yes Yes 2 2 2 41 1.42 14
# 7 255 Germany Male Yes Yes 3 3 3 62 1.23 16
# 8 270 Germany Male Yes Yes 3 6 5 65 0.978 14
# 9 304 Germany Female Yes Yes 1 1 5 47 0.984 13
# 10 311 Germany Male Yes Yes 0 0 5 67 0.535 18
# # … with 5,344 more rows
5 / 32

Why visualize? A simulated example

  • We are better in detecting visual patterns in figures compared to numeric patterns in tables.
  • You will reach wider audiences with figures than with tables.
  • You will understand your own data faster while exploring it.
# Multilevel mixed effects model.
lmer(data = sim_data,
formula = xeno ~ education +
(1 + education | Country)) %>%
stargazer(type = "text", style = "asr")
#
# ---------------------------------
# xeno
# ---------------------------------
# education -0.799***
# Constant 0.314
# N 500
# Log Likelihood -154.000
# AIC 320.000
# BIC 345.000
# ---------------------------------
# *p < .05; **p < .01; ***p < .001
6 / 32

Why visualize? A simulated example

  • We are better in detecting visual patterns in figures compared to numeric patterns in tables.
  • You will reach wider audiences with figures than with tables.
  • You will understand your own data faster while exploring it.
# Multilevel mixed effects model.
lmer(data = sim_data,
formula = xeno ~ education +
(1 + education | Country)) %>%
stargazer(type = "text", style = "asr")
#
# ---------------------------------
# xeno
# ---------------------------------
# education -0.799***
# Constant 0.314
# N 500
# Log Likelihood -154.000
# AIC 320.000
# BIC 345.000
# ---------------------------------
# *p < .05; **p < .01; ***p < .001

6 / 32

Why ggplot2? Because of its grammar of graphics

Independently specify the building blocks of a figure and combine them to create just about any kind of figure you want; its like Lego ;-).

7 / 32

The coordinate system

ggplot() # Create an empty coordinate system.

8 / 32

The coordinate system

ggplot(data = ESS) # Create an empty coordinate system for the ESS data.

9 / 32

Layers

ggplot(data = ESS) + # Add ...
geom_point(mapping = aes(y = imwbcnt, x = eduyrs)) # a "layer" of points (i.e., a scatter plot).

10 / 32

Layers

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs))

11 / 32

The layered grammar of graphics

# A general template
ggplot(data = <DATA>) + # Create a coordinate system for <DATA>, and add "+"
<GEOM_FUNCTION>( # a layer of (geometric) information, which
mapping = aes(<MAPPINGS>), # maps our data to aestetics, and
stat = <STAT>, # may depend on statistical transformations.
position = <POSITION> # Positioning may be adjusted.
) +
<COORDINATE_FUNCTION> + # Change the default coordinate system.
<FACET_FUNCTION> # Draw sub-plots by categorical variables.

Source: Wickham & Grolemund "R for Data Science"

ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system.

12 / 32

The layered grammar of graphics

# A general template
ggplot(data = <DATA>) + # Create a coordinate system for <DATA>, and add "+"
<GEOM_FUNCTION>( # a layer of (geometric) information, which
mapping = aes(<MAPPINGS>), # maps our data to aestetics, and
stat = <STAT>, # may depend on statistical transformations.
position = <POSITION> # Positioning may be adjusted.
) +
<COORDINATE_FUNCTION> + # Change the default coordinate system.
<FACET_FUNCTION> # Draw sub-plots by categorical variables.

Source: Wickham & Grolemund "R for Data Science"

ggplot2 contains many geom functions, which put layers of different types of geometric objects (e.g., points, bars, lines) over a coordinate system.

  • All geom functions depend on the mapping argument. It is paired with aes(), which stands for "aestetic". Aestetics are the visual properties of your plot.
  • The most important aestetics of any graph are the y-axis and the x-axis. Therefore, aes() depends on x and y, because these specify which variable to map to the y-axis and which one to map to the x-axis.
  • But of course, aestetics also means, among others, color, shape, size, and so on.
12 / 32

Aestetics The visual properties of your plot

If you want to have an aestetic depend on the values of a variable, you need to specify it within aes().

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs, color = cntry)) # Color by country.

13 / 32
  • ggplot2 will automatically assign a unique aesthetic (e.g., color/shape/size/etc.) to each value of the variable.

  • It will also generate a legend.

Aestetics

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs, color = cntry,
size = pspwght)) # Size by post-stratification weight.

14 / 32

You can manually control the aestetics, that is, which color and which sizes. But that is fine tuning. We want to explore our data right now.

Aestetics

Because R is object-oriented, aestetics behave differently depending on whether you give it a categorical or a continuous variable.

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs,
color = pspwght, size = cntry)) # Exchanged color and date aes.

15 / 32
  • Now color is gradual, rather than different colors.

  • For size, a categorical variable makes little sense.

  • Categorical are factor and character vectors.

  • continuous are numerical vectors.

Aestetics

If you want to define an aestetic irrespective of the values of any variable, you need to place outside the mapping argument.

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs,
size = pspwght, color = cntry),
shape = 21) # Use hollow circles.

16 / 32
  • alpha adds transparency, which varies between 0 (see through) and 1 nontransparent.

  • You need to give that aestetic a value that makes sense to it.

Geometric objects As what do you visualize your data?

How are these two plots similar?

ggplot(data = ESS) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs))

ggplot(data = ESS) +
geom_smooth(mapping = aes(y = imwbcnt, x = eduyrs))

17 / 32
  • They show the same data, but expressed as different geometric objects.
  • ggplot2 contains +30 geoms. Extension packages contain even more.

Geometric objects As what do you visualize your data?

ggplot(data = ESS) +
geom_boxplot(mapping = aes(y = imwbcnt,
x = factor(eduyrs)))





Source: Wikipedia

18 / 32

Geoms & weights Apply or visualize, it depends on the geom ...

ggplot(data = ESS) +
geom_point(aes(y = imwbcnt, x = eduyrs,
size = pspwght)) # Visualize

ggplot(data = ESS) +
geom_smooth(aes(y = imwbcnt, x = eduyrs,
weight = pspwght)) # Apply

19 / 32

Geoms & aestetics Some aestetics are geom specific

ggplot(data = ESS) +
geom_point(aes(y = imwbcnt, x=eduyrs, color=cntry,
size = pspwght,
shape = cntry))

ggplot(data = ESS) +
geom_smooth(aes(y=imwbcnt, x=eduyrs, color=cntry,
weight = pspwght,
linetype = cntry))

20 / 32
  • We can use the color aestetic in both plots.

  • We cannot use shape for lines and line types for points.

  • Note that ggplot2 automatically groups data for geoms whenever you map an aesthetic to a categorical variable!

Multiple geoms Are layered on top of each other

To have several geoms in one plot, simply add + them on top of each other.

ggplot(data = ESS) + # Coordinate system, add ...
geom_point(mapping = aes(y = imwbcnt, x = eduyrs, size = pspwght)) + # layer of points, add ...
geom_smooth(mapping = aes(y = imwbcnt, x = eduyrs, weight = pspwght)) # layer of smoothed average & 95%-CI.

21 / 32

Multiple geoms

The order of geoms matters, ggplot2 adds layer on top of layer.

ggplot(data = ESS) + # Coordinate system, add ...
geom_smooth(mapping = aes(y = imwbcnt, x = eduyrs, weight = pspwght)) +
geom_point(mapping = aes(y = imwbcnt, x = eduyrs, size = pspwght))

22 / 32

Global aestetics

To avoid repetitive code, we can specify global aestetics, which (by default) hold for all geoms.

ggplot(data = ESS, mapping = aes(y = imwbcnt, x = eduyrs)) + # Coord. system with global aestetics, add ...
geom_point() + # a layer of points, add ...
geom_smooth() # a layer with a line of locally-smoothed averages and CI.

23 / 32

By the way, this is a nice example why graphics are great for data exploration:

ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) + # Coord. system with global aestetics, add ...
geom_boxplot() + # a layer of boxplots, add ...
# For some reason, geom_smooth needs the "aes(group = 1)" argument.
geom_smooth(mapping = aes(group = 1), se = FALSE) + # No CI (i.e., confidence interval), add ...
geom_smooth(mapping = aes(group = 1), method = "lm", se = FALSE, color = "red") # an OLS line.

24 / 32

Local aestetics For the single geoms

Beware, local aesthetics override the global (default) aestetics!

ggplot(data = ESS, mapping = aes(y = imwbcnt, x = eduyrs)) +
geom_point(mapping = aes(color = cntry, size = pspwght), alpha = 0.2) + # aes() for geom_point exclusively.
geom_smooth(mapping = aes(y = agea, weight = pspwght))

25 / 32

Putting it all together:

ggplot(data = ESS, # Coordinate system, add ...
mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) + # define global aestetics, add ...
geom_boxplot() + # a layer of boxplots, add
geom_smooth(mapping = aes(color = cntry, group = cntry)) # Add smooth for each country.

26 / 32

Facets Sub-plots by categorical type

When another layer of (important) information does not improve the plot.

ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) +
geom_boxplot() +
geom_smooth(mapping = aes(group = 1)) +
facet_wrap( ~ cntry, nrow = 1) # Make sub-plots by cntry.

27 / 32

Consider whether faceting helps to see the comparisons you are interested in!

Facet grid A cross-table of plots

ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) +
geom_boxplot() +
geom_smooth(mapping = aes(group = 1)) +
facet_grid(gndr ~ cntry) # Make sub-plots by gender (row) ~ country (column).

28 / 32

Save your plot

ggsave() allows you to save your plot as pdf, jpeg, png, tiff, svg, bmp, ps, eps. It will guess the type from the ending of the name you give the plot (e.g., "MyPlot.pdf").

# Make our plot and assign it to object my_plot.
my_plot <- ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) +
geom_boxplot() +
geom_smooth(mapping = aes(group = 1)) +
facet_grid(gndr ~ cntry)
# Save the plot into the working directory as pdf. It shall be 9 inches wide and 4.5 inches high.
ggsave(filename = "myplot1.pdf", plot = my_plot, width = 8, height = 3)
29 / 32

Save your plot

ggsave() allows you to save your plot as pdf, jpeg, png, tiff, svg, bmp, ps, eps. It will guess the type from the ending of the name you give the plot (e.g., "MyPlot.pdf").

# Make our plot and assign it to object my_plot.
my_plot <- ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) +
geom_boxplot() +
geom_smooth(mapping = aes(group = 1)) +
facet_grid(gndr ~ cntry)
# Save the plot into the working directory as pdf. It shall be 9 inches wide and 4.5 inches high.
ggsave(filename = "myplot1.pdf", plot = my_plot, width = 8, height = 3)
# Save the plot but with different margins.
ggsave(filename = "myplot2.pdf", plot = my_plot, width = 16, height = 9)
29 / 32

Save your plot

ggsave() allows you to save your plot as pdf, jpeg, png, tiff, svg, bmp, ps, eps. It will guess the type from the ending of the name you give the plot (e.g., "MyPlot.pdf").

# Make our plot and assign it to object my_plot.
my_plot <- ggplot(data = ESS, mapping = aes(y = imwbcnt, x = factor(eduyrs), weight = pspwght)) +
geom_boxplot() +
geom_smooth(mapping = aes(group = 1)) +
facet_grid(gndr ~ cntry)
# Save the plot into the working directory as pdf. It shall be 9 inches wide and 4.5 inches high.
ggsave(filename = "myplot1.pdf", plot = my_plot, width = 8, height = 3)
# Save the plot but with different margins.
ggsave(filename = "myplot2.pdf", plot = my_plot, width = 16, height = 9)
# Save the plot as jpeg, again different margins and very low resolution.
ggsave(filename = "myplot1.jpeg", plot = my_plot, width = 4.5, height = 9, dpi = 50)
29 / 32

PDFs do not need the dpi argument, because they are vector-based graphics.

30 / 32

Today's general lesson

# A general template
ggplot(data = <DATA>) + # Create a coordinate system for <DATA>, and add "+"
<GEOM_FUNCTION>( # a layer of (geometric) information, which
mapping = aes(<MAPPINGS>), # maps our data to aestetics, and
stat = <STAT>, # may depend on statistical transformations.
position = <POSITION> # Positioning may be adjusted.
) +
<COORDINATE_FUNCTION> + # Change the default coordinate system.
<FACET_FUNCTION> # Draw sub-plots by categorical variables.

Source: Wickham & Grolemund "R for Data Science"

31 / 32

Today's (important) functions

  1. transmute(): similar to mutate, but only keeps the newly generated variables.
32 / 32

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow