Your instructor - Rick Scavetta
Drawing your first plot
To get a first feel for ggplot2
, let’s try to run some basic ggplot2
commands. The mtcars
dataset contains information on 32 cars from a 1973 issue of Motor Trend magazine. This dataset is small, intuitive, and contains a variety of continuous and categorical variables.
# Load the ggplot2 package
library(pacman)
p_load(ggplot2)
# Explore the mtcars data frame with str()
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# Execute the following command
p <- ggplot(mtcars, aes(cyl, mpg)) +
geom_point()
Notice that ggplot2
treats cyl
as a continuous variable. You get a plot, but it’s not quite right, because it gives the impression that there is such a thing as a 5 or 7-cylinder car, which there is not.
Data columns types affect plot types
The plot from the previous exercise wasn’t really satisfying. Although cyl
(the number of cylinders) is categorical, you probably noticed that it is classified as numeric in mtcars
. This is really misleading because the representation in the plot doesn’t match the actual data type. You’ll have to explicitly tell ggplot2
that cyl
is a categorical variable.
# Change the command below so that cyl is treated as factor
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_point()
Notice that ggplot2
treats cyl
as a factor. This time the x-axis does not contain variables like 5 or 7, only the values that are present in the dataset.
The grammar of graphics
Mapping data columns to aesthetics
Let’s dive a little deeper into the three main topics in this course: The data, aesthetics, and geom layers. We’ll get to making pretty plots in the last chapter with the themes layer.
We’ll continue working on the 32 cars in the mtcars
data frame.
Consider how the examples and concepts we discuss throughout these courses apply to your own data-sets!
# Edit to add a color aesthetic mapped to disp
ggplot(mtcars, aes(wt, mpg, color = disp)) +
geom_point()
# Change the color aesthetic to a size aesthetic
ggplot(mtcars, aes(wt, mpg, size = disp)) +
geom_point()
Notice that a legend for the color and size scales was automatically generated.
ggplot2 layers
Adding geometries
The diamonds
dataset contains details of 1,000 diamonds. Among the variables included are carat
(a measurement of the diamond’s size) and price
.
You’ll use two common geom layer functions:
As you saw previously, these are added using the +
operator.
ggplot(data, aes(x, y)) +
geom_*()
Where *
is the specific geometry needed.
# Explore the diamonds data frame with str()
data(diamonds)
str(diamonds)
## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
# Add geom_point() with +
ggplot(diamonds, aes(carat, price)) +
geom_point()
# Add geom_smooth() with +
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
If you had executed the command without adding a +
, it would produce an error message ‘No layers in plot’ because you are missing the third essential layer - the geom layer.
Changing one geom or every geom
If you have multiple geoms, then mapping an aesthetic to data variable inside the call to ggplot()
will change all the geoms. It is also possible to make changes to individual geoms by passing arguments to the geom_*()
functions.
geom_point(
) has an alpha
argument that controls the opacity of the points. A value of 1
(the default) means that the points are totally opaque; a value of 0
means the points are totally transparent (and therefore invisible). Values in between specify transparency.
The plot you drew last time is provided in the script.
# Map the color aesthetic to clarity
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# Make the points 40% opaque
ggplot(diamonds, aes(carat, price, color = clarity)) +
geom_point(alpha = 0.4) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
geom_point()
+ geom_smooth()
is a common combination.
Saving plots as variables
Plots can be saved as variables, which can be added to later on using the +
operator. This is really useful if you want to make multiple related plots from a common base.
# Draw a ggplot
plt_price_vs_carat <- ggplot(
# Use the diamonds dataset
diamonds,
# For the aesthetics, map x to carat and y to price
aes(carat, price)
)
# Add a point layer to plt_price_vs_carat
plt_price_vs_carat + geom_point()
# Edit this to make points 20% opaque: plt_price_vs_carat_transparent
plt_price_vs_carat_transparent <- plt_price_vs_carat + geom_point(alpha = 0.2)
# See the plot
plt_price_vs_carat_transparent
# Edit this to map color to clarity,
# Assign the updated plot to a new object
plt_price_vs_carat_by_clarity <- plt_price_vs_carat + geom_point(aes(color = clarity))
# See the plot
plt_price_vs_carat_by_clarity
By assigning parts of plots to a variable then reusing that variable in other plots, it makes it really clear how much those plots have in common.
Visible aesthetics
All about aesthetics: color, shape and size
In the video you saw 9 visible aesthetics. Let’s apply them to a categorical variable — the cylinders in mtcars
, cyl
.
These are the aesthetics you can consider within aes() in this chapter: x
, y
, color
, fill
, size
, alpha
, labels
and shape
.
One common convention is that you don’t name the x
and y
arguments to aes()
, since they almost always come first, but you do name other arguments.
In the following exercise the fcyl
column is categorical. It is cyl
transformed into a factor.
mtcars$fcyl <- as.factor(mtcars$cyl)
# Map x to mpg and y to fcyl
ggplot(mtcars, aes(mpg, fcyl)) +
geom_point()
# Swap mpg and fcyl
ggplot(mtcars, aes(fcyl, mpg)) +
geom_point()
# Map x to wt, y to mpg and color to fcyl
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
geom_point()
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Set the shape and size of the points
geom_point(shape = 1, size = 4)
Head over to the next exercise for some more aesthetics!
All about aesthetics: color vs. fill
Typically, the color
aesthetic changes the outline of a geom and the fill
aesthetic changes the inside. geom_point()
is an exception: you use color
(not fill
) for the point color. However, some shapes have special behavior.
The default geom_point()
uses shape = 19
: a solid circle. An alternative is shape = 21
: a circle that allows you to use both fill for the inside and color for the outline. This lets you map two aesthetics to each point.
All shape values are described on the points() help page.
fcyl
and fam
are the cyl
and am
columns converted to factors, respectively.
mtcars$fam <- as.factor(mtcars$am)
# Map fcyl to fill
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
geom_point(shape = 1, size = 4)
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Change point shape; set alpha
geom_point(shape = 21, size = 4, alpha = 0.6)
# Map color to fam
ggplot(mtcars, aes(wt, mpg, fill = fcyl, color = fam)) +
geom_point(shape = 21, size = 4, alpha = 0.6)
Notice that mapping a categorical variable onto fill doesn’t change the colors, although a legend is generated! This is because the default shape for points only has a color attribute and not a fill attribute! Use fill when you have another shape (such as a bar), or when using a point that does have a fill and a color attribute, such as shape = 21
, which is a circle with an outline. Any time you use a solid color, make sure to use alpha blending to account for over plotting.
All about aesthetics: comparing aesthetics
Now that you’ve got some practice with using attributes, be careful of a major pitfall: these attributes can overwrite the aesthetics of your plot!
# Establish the base layer
plt_mpg_vs_wt <- ggplot(mtcars, aes(wt, mpg))
# Map fcyl to size
plt_mpg_vs_wt +
geom_point(aes(size = fcyl))
## Warning: Using size for a discrete variable is not advised.
# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
geom_point(aes(alpha = fcyl))
## Warning: Using alpha for a discrete variable is not advised.
# Map fcyl to shape, not alpha
plt_mpg_vs_wt +
geom_point(aes(shape = fcyl))
# Use text layer and map fcyl to label
plt_mpg_vs_wt +
geom_text(aes(label = fcyl))
Which aesthetic do you think is the clearest for categorical data?
Using attributes
All about attributes: color, shape, size and alpha
This time you’ll use these arguments to set attributes of the plot, not map variables onto aesthetics.
You can specify colors in R using hex codes: a hash followed by two hexadecimal numbers each for red, green, and blue (“#RRGGBB
”). Hexadecimal is base-16 counting. You have 0 to 9, and A representing 10 up to F representing 15. Pairs of hexadecimal numbers give you a range from 0 to 255. “#000000
” is “black” (no color), “#FFFFFF
” means “white”, and “#00FFFF
” is cyan (mixed green and blue).
A hexadecimal color variable, my_blue
has been defined for you.
# A hexadecimal color
my_blue <- "#4ABEFF"
ggplot(mtcars, aes(wt, mpg)) +
# Set the point color and alpha
geom_point(color = my_blue, alpha = 0.6)
# Change the color mapping to a fill mapping
ggplot(mtcars, aes(wt, mpg, fill = fcyl)) +
# Set point size and shape
geom_point(color = my_blue, size = 10, shape = 21)
ggplot2
lets you control these attributes in many ways to customize your plots.
All about attributes: conflicts with aesthetics
In the videos you saw that you can use all the aesthetics as attributes. Let’s see how this works with the aesthetics you used in the previous exercises: x
, y
, color
, fill
, size
, alpha
, label
and shape
.
In this exercise you will set all kinds of attributes of the points!
You will continue to work with mtcars
.
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add point layer with alpha 0.5
geom_point(alpha = 0.5)
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add text layer with label rownames(mtcars) and color red
geom_text(label = rownames(mtcars), color = "red")
ggplot(mtcars, aes(wt, mpg, color = fcyl)) +
# Add points layer with shape 24 and color yellow
geom_point(shape = 24, color = "yellow")
Going all out
In this exercise, you will gradually add more aesthetics layers to the plot. You’re still working with the mtcars
dataset, but this time you’re using more features of the cars. Each of the columns is described on the mtcars help page. Columns fcyl
and fam
have been added (as before) as categorical versions of cyl
and am
respectively.
Notice that adding more aesthetic mappings to your plot is not always a good idea! You may just increase complexity and decrease readability.
# 3 aesthetics: qsec vs. mpg, colored by fcyl
ggplot(mtcars, aes(mpg, qsec, color = fcyl)) +
geom_point()
# 4 aesthetics: add a mapping of shape to fam
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam)) +
geom_point()
# 5 aesthetics: add a mapping of size to hp / wt
ggplot(mtcars, aes(mpg, qsec, color = fcyl, shape = fam, size = hp / wt)) +
geom_point()
Between the x and y dimensions, the color, shape, and size of the points, your plot displays five dimensions of the dataset!
Modifying aesthetics
position
Adjustment for overlapping
Updating aesthetic labels
In this exercise, you’ll modify some aesthetics to make a bar plot of the number of cylinders for cars with different types of transmission.
You’ll also make use of some functions for improving the appearance of the plot.
values
is a named vector of colors to use.levels(mtcars$fam) <- c("automatic", "manual")
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
# Set the axis labels
labs(x = "Number of Cylinders",
y = "Count")
palette <- c(automatic = "#377EB8", manual = "#E41A1C")
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
labs(x = "Number of Cylinders", y = "Count") +
# Set the fill color scale
scale_fill_manual("Transmission", values = palette)
# Set the position
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar(position = "dodge") +
labs(x = "Number of Cylinders", y = "Count") +
scale_fill_manual("Transmission", values = palette)
Setting a dummy aesthetic
In the last chapter you saw that all the visible aesthetics can serve as attributes and aesthetics, but I very conveniently left out x and y. That’s because although you can make univariate plots (such as histograms, which you’ll get to in the next chapter), a y-axis will always be provided, even if you didn’t ask for it.
You can make univariate plots in ggplot2
, but you will need to add a fake y axis by mapping y to zero.
When using setting y-axis limits, you can specify the limits as separate arguments, or as a single numeric vector. That is, ylim(lo, hi)
or ylim(c(lo, hi))
.
# Plot 0 vs. mpg
ggplot(mtcars, aes(mpg, 0)) +
# Add jitter
geom_point(position = "jitter")
ggplot(mtcars, aes(mpg, 0)) +
geom_jitter() +
# Set the y-axis limits
ylim(c(-2, 2))
The best way to make your plot depends on a lot of different factors and sometimes ggplot2
might not be the best choice.
Aesthetics best practices
Which aesthetics?
Form follows functions
Function
Primary:
Secondary:
Guiding principles
Never:
Always:
The best choices for aesthetics
Continuous data
Categorical data
Scatter plots
Overplotting 1: large datasets
Scatter plots (using geom_point()
) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
Let’s use the diamonds
dataset to practice dealing with the large dataset case.
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))
# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = ".")
# Set shape to 16
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)
Overplotting 2: Aligned values
Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.
This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.
In the mtcars
data set, fam
and fcyl
are categorical variants of cyl
and am
.
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()
# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(0.3))
# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))
These are some simple ways of dealing with overplotting, but you’ll encounter more ideas throughout the ggplot2
courses when we encounter atypical geoms.
Overplotting 3: Low-precision data
You already saw how to deal with overplotting when using geom_point() in two cases:
We used position = 'jitter'
inside geom_point() or geom_jitter().
Let’s take a look at another case:
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It’s similar to case 2, but in this case we can jitter on both the x and y axis.
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Swap for jitter layer with width 0.1
geom_jitter(alpha = 0.5, width = 0.1)
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(alpha = 0.5, position = "jitter")
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(alpha = 0.5, position = position_jitter(0.1))
Notice that jitter
can be a geom itself (i.e. geom_jitter()
), an argument in geom_point()
(i.e. position = "jitter"
), or a position function, (i.e. position_jitter()
).
Overplotting 4: Integer data
Let’s take a look at the last case of dealing with overplotting:
This can be type integer
(i.e. 1 ,2, 3…) or categorical (i.e. class factor
) variables. factor
is just a special class of type integer
.
You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.
The Vocab
dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.
p_load(car)
data(Vocab)
# Examine the structure of Vocab
str(Vocab)
## 'data.frame': 30351 obs. of 4 variables:
## $ year : num 1974 1974 1974 1974 1974 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
## $ education : num 14 16 10 10 12 16 17 10 12 11 ...
## $ vocabulary: num 9 9 9 5 8 8 9 5 3 5 ...
## - attr(*, "na.action")= 'omit' Named int [1:32115] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:32115] "19720001" "19720002" "19720003" "19720004" ...
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
# Add a point layer
geom_point()
ggplot(Vocab, aes(education, vocabulary)) +
# Change to a jitter layer
geom_jitter()
ggplot(Vocab, aes(education, vocabulary)) +
# Set the transparency to 0.2
geom_jitter(alpha = 0.2)
ggplot(Vocab, aes(education, vocabulary)) +
# Set the shape to 1
geom_jitter(alpha = 0.2, shape = 1)
Notice how jittering and alpha blending serves as a great solution to the overplotting problem here. Setting the shape
to 1
didn’t really help, but it was useful in the previous exercises when you had less data. You need to consider each plot individually. You’ll encounter this dataset again when you look at bar plots.
Histograms
Drawing histograms
Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count
variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density
can be accessed by using the ..
notation, i.e. ..density..
. Plotting this variable will show the relative frequency, which is the height times the width of each bin.
# Plot mpg
ggplot(mtcars, aes(mpg)) +
# Add a histogram layer
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mtcars, aes(mpg)) +
# Set the binwidth to 1
geom_histogram(binwidth = 1)
# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
geom_histogram(binwidth = 1)
datacamp_light_blue <- "#51A8C9"
ggplot(mtcars, aes(mpg, ..density..)) +
# Set the fill color to datacamp_light_blue
geom_histogram(binwidth = 1, fill = datacamp_light_blue)
Histograms are one of the most common exploratory plots for continuous data. If you want to use density on the y-axis be sure to set your binwidth to an intuitive value.
Positions in histograms
Here, we’ll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position
argument that can take on the following values:
stack
(the default): Bars for different groups are stacked on top of each other.dodge
: Bars for different groups are placed side by side.fill
: Bars for different groups are shown as proportions.identity
: Plot the values as they appear in the dataset.# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
geom_histogram(binwidth = 1)
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to dodge
geom_histogram(binwidth = 1, position = "dodge")
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to fill
geom_histogram(binwidth = 1, position = "fill")
## Warning: Removed 16 rows containing missing values (geom_bar).
ggplot(mtcars, aes(mpg, fill = fam)) +
# Change the position to identity, with transparency 0.4
geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)
Bar plots
Position in bar and col plots
Let’s see how the position
argument changes geom_bar().
We have three position options:
stack
: The defaultdodge
: Preferredfill
: To show proportionsWhile we will be using geom_bar()
here, note that the function geom_col() is just geom_bar()
where both the position
and stat
arguments are set to "identity"
. It is used when we want the heights of the bars to represent the exact values in the data.
In this exercise, you’ll draw the total count of cars having a given number of cylinders (fcyl
), according to manual or automatic transmission type (fam
).
# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Add a bar layer
geom_bar()
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Set the position to "fill"
geom_bar(position = "fill")
ggplot(mtcars, aes(fcyl, fill = fam)) +
# Change the position to "dodge"
geom_bar(position = "dodge")
Different kinds of plots need different position
arguments, so it’s important to be familiar with this attribute.
Overlapping bar plots
You can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = "dodge"
, you’re going to use position_dodge()
, like you did with position_jitter()
in the the previous exercises. Here, you’ll save this as an object, posn_d
, so that you can easily reuse it.
Remember, the reason you want to use position_dodge()
(and position_jitter()
) is to specify how much dodging (or jittering) you want.
For this example, you’ll use the mtcars
dataset.
ggplot(mtcars, aes(cyl, fill = fam)) +
# Change position to use the functional form, with width 0.2
geom_bar(position = position_dodge(width = 0.2))
ggplot(mtcars, aes(cyl, fill = fam)) +
# Set the transparency to 0.6
geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)
By using these position functions, you can customize your plot to suit your needs.
Bar plots: sequential color palette
In this bar plot, we’ll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.
Here’s an example of using a sequential color palette with the mtcars
dataset:
ggplot(mtcars, aes(fcyl, fill = fam)) +
geom_bar() +
scale_fill_brewer(palette = "Set1")
In the exercise, you’ll use similar code on the the Vocab
dataset. Both datasets are ordinal.
Vocab$education <- as.factor(Vocab$education)
Vocab$vocabulary <- as.factor(Vocab$vocabulary)
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill")
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_brewer()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Blues is 9
## Returning the palette you asked for with that many colors
#Good job! However, the plot is not complete! Let's fix this in the next exercise.
p_load(RColorBrewer)
# Definition of a set of blue colors
blues <- brewer.pal(9, "Blues") # from the RColorBrewer package
# 1 - Make a color range using colorRampPalette() and the set of blues
blue_range <- colorRampPalette(blues)
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
# Add a bar layer with position "fill"
geom_bar(position = "fill") +
# Add a brewer fill scale with default palette
scale_fill_manual(values = blue_range(11))
Line plots
Basic line plots
Here, we’ll use the economics
dataset to make some line plots. The dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States. The data is contained in the ggplot2
package.
To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.
# Print the head of economics
head(economics)
## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
# Make it a line plot
geom_line()
# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
In the next exercise, we’ll make more complicated line plots.
Multiple time series
We already saw how the form of your data affects how you can plot it. Let’s explore that further with multiple time series. Here, it’s important that all lines are on the same scale, and if possible, on the same plot.
fish.species
contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy
contains the same data, but in three columns: Species
, Year
, and Capture
(i.e. one variable per column).
load("_data/fish.RData")
str(fish.species)
## 'data.frame': 61 obs. of 8 variables:
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ Pink : int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
## $ Chum : int 139300 155900 113800 99800 148700 143700 158480 125377 132407 113114 ...
## $ Sockeye : int 64100 51200 58200 66100 83800 72000 84800 69676 100520 62472 ...
## $ Coho : int 30500 40900 33600 32400 38300 45100 40000 39900 39200 32865 ...
## $ Rainbow : int 0 100 100 100 100 100 100 100 100 100 ...
## $ Chinook : int 23200 25500 24900 25300 24500 27700 25300 21200 20900 20335 ...
## $ Atlantic: int 10800 9701 9800 8800 9600 7800 8100 9000 8801 8700 ...
# Use gather to go from fish.species to fish.tidy
#fish.tidy <- gather(fish.species, Species, Capture, -Year)
str(fish.tidy)
## 'data.frame': 427 obs. of 3 variables:
## $ Species: Factor w/ 7 levels "Pink","Chum",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ Capture: int 100600 259000 132600 235900 123400 244400 203400 270119 200798 200085 ...
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
geom_line()
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
geom_line()
# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
geom_line(aes(group = Species))
# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(Year, Capture, color = Species)) +
geom_line(aes(group = Species))
As you can see in the the last couple of plots, a grouping aesthetic was vital here. If you don’t specify color = Species
, you’ll get a mess of lines.
Themes from scratch
element_blank() is used to remove an item
Moving the legend
Let’s wrap up this course by making a publication-ready plot communicating a clear message.
To change stylistic elements of a plot, call theme() and set plot properties to a new value. For example, the following changes the legend position.
p + theme(legend.position = new_value)
Here, the new value can be
"top"
, "bottom"
, "left"
, or "right"
: place it at that side of the plot."none"
: don’t draw it.c(x, y)
: c(0, 0)
means the bottom-left and c(1, 1)
means the top-right.# View the default plot
plt_prop_unemployed_over_time
# Remove legend entirely
plt_prop_unemployed_over_time +
theme(legend.position = "none")
# Position the legend at the bottom of the plot
plt_prop_unemployed_over_time +
theme(legend.position = "bottom")
# Position the legend inside the plot at (0.6, 0.1)
plt_prop_unemployed_over_time +
theme(legend.position = c(0.6, 0.1))
Nice! But be careful when placing a legend inside your plotting space. You could end up obscuring data.
Modifying theme elements Many plot elements have multiple properties that can be set. For example, line elements in the plot such as axes and gridlines have a color, a thickness (size
), and a line type (solid line, dashed, or dotted). To set the style of a line, you use element_line(). For example, to make the axis lines into red, dashed lines, you would use the following.
p + theme(axis.line = element_line(color = "red", linetype = "dashed"))
Similarly,element_rect()
changes rectangles and element_text()
changes text. You can remove a plot element using element_blank()
.
plt_prop_unemployed_over_time
is available.
plt_prop_unemployed_over_time +
theme(
# For all rectangles, set the fill color to grey92
rect = element_rect(fill = "grey92"),
# For the legend key, turn off the outline
legend.key = element_rect(color = NA)
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
# Turn off axis ticks
axis.ticks = element_blank(),
# Turn off the panel grid
panel.grid = element_blank()
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
# Add major y-axis panel grid lines back
panel.grid.major.y = element_line(
# Set the color to white
color = "white",
# Set the size to 0.5
size = 0.5,
# Set the line type to dotted
linetype = "dotted"
)
)
plt_prop_unemployed_over_time +
theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(
color = "white",
size = 0.5,
linetype = "dotted"
),
# Set the axis text color to grey25
axis.text = element_text(color = "grey25"),
# Set the plot title font face to italic and font size to 16
plot.title = element_text(face = "italic", size = 16)
)
This plot is ready for prime time – it’s pretty AND informative. Make sure that all your text is legible for the context in which it will be viewed.
Modifying whitespace Whitespace means all the non-visible margins and spacing in the plot.
To set a single whitespace value, use unit(x, unit), where x
is the amount and unit
is the unit of measure.
Borders require you to set 4 positions, so use margin(top, right, bottom, left, unit). To remember the margin order, think TRouBLe.
The default unit is "pt"
(points), which scales well with text. Other options include “cm”, “in” (inches) and “lines” (of text).
plt_mpg_vs_wt_by_cyl
is available. The panel and legend are wrapped in blue boxes so you can see how they change.
#view original plot
plt_mpg_vs_wt_by_cyl
plt_mpg_vs_wt_by_cyl +
theme(
# Set the axis tick length to 2 lines
axis.ticks.length = unit(2, "lines")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend key size to 3 centimeters
legend.key.size = unit(3, "cm")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the legend margin to (20, 30, 40, 50) points
legend.margin = margin(20, 30, 40, 50, "pt")
)
plt_mpg_vs_wt_by_cyl +
theme(
# Set the plot margin to (10, 30, 50, 70) millimeters
plot.margin = margin(10, 30, 50, 70, "mm")
)
Wonderful whitespace widening! Changing the whitespace can be useful if you need to make your plot more compact, or if you want to create more space to reduce “business”.
Theme flexibility
Ways to use themes
Defining theme objects
Built-in themes
In addition to making your own themes, there are several out-of-the-box solutions that may save you lots of time.
plt_prop_unemployed_over_time
is available.
# Add a black and white theme
plt_prop_unemployed_over_time +
theme_bw()
# Add a classic theme
plt_prop_unemployed_over_time +
theme_classic()
# Add a void theme
plt_prop_unemployed_over_time +
theme_void()
The black and white theme works really well if you use transparency in your plot.
Exploring ggthemes
Outside of ggplot2
, another source of built-in themes is the ggthemes
package. The workspace already contains the plt_prop_unemployed_over_time
, the line plot from before. Let’s explore some of the ready-made ggthemes
themes.
plt_prop_unemployed_over_time is available
.
# Use the fivethirtyeight theme
plt_prop_unemployed_over_time +
theme_fivethirtyeight()
# Use Tufte's theme
plt_prop_unemployed_over_time +
theme_tufte()
# Use the Wall Street Journal theme
plt_prop_unemployed_over_time +
theme_wsj()
ggthemes
has over 20 themes for you to try.
Setting themes
Reusing a theme across many plots helps to provide a consistent style. You have several options for this.
A good strategy that you’ll use here is to begin with a built-in theme then modify it.
plt_prop_unemployed_over_time
is available. The theme you made earlier is shown in the sample code.
# Theme layer saved as an object, theme_recession
theme_recession <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
# Combine the Tufte theme with theme_recession
theme_tufte_recession <- theme_tufte() + theme_recession
# Add the recession theme to the plot
plt_prop_unemployed_over_time + theme_tufte_recession
theme_recession <- theme(
rect = element_rect(fill = "grey92"),
legend.key = element_rect(color = NA),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.y = element_line(color = "white", size = 0.5, linetype = "dotted"),
axis.text = element_text(color = "grey25"),
plot.title = element_text(face = "italic", size = 16),
legend.position = c(0.6, 0.1)
)
theme_tufte_recession <- theme_tufte() + theme_recession
# Set theme_tufte_recession as the default theme
theme_set(theme_tufte_recession)
# Draw the plot (without explicitly adding a theme)
plt_prop_unemployed_over_time
Publication-quality plots
We’ve seen many examples of beautiful, publication-quality plots. Let’s take a final look and put all the pieces together.
plt_prop_unemployed_over_time
is available.
plt_prop_unemployed_over_time +
# Add Tufte's theme
theme_tufte()
plt_prop_unemployed_over_time +
theme_tufte() +
# Add individual theme elements
theme(
# Turn off the legend
legend.position = "none",
# Turn off the axis ticks
axis.ticks = element_blank()
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
# Set the axis title's text color to grey60
axis.title = element_text(color = "grey60"),
# Set the axis text's text color to grey60
axis.text = element_text(color = "grey60")
)
plt_prop_unemployed_over_time +
theme_tufte() +
theme(
legend.position = "none",
axis.ticks = element_blank(),
axis.title = element_text(color = "grey60"),
axis.text = element_text(color = "grey60"),
# Set the panel gridlines major y values
panel.grid.major.y = element_line(
# Set the color to grey60
color = "grey60",
# Set the size to 0.25
size = 0.25,
# Set the linetype to dotted
linetype = "dotted"
)
)
That will look great in a publication!
Effective explanatory plots
we take the top ten and bottom ten observations from the data set
Using geoms for explanatory plots
Let’s focus on producing beautiful and effective explanatory plots. In the next couple of exercises, you’ll create a plot that is similar to the one shown in the video using gm2007
, a filtered subset of the gapminder
dataset.
This type of plot will be in an info-viz style, meaning that it would be similar to something you’d see in a magazine or website for a mostly lay audience.
A scatterplot of lifeExp
by country
, colored by lifeExp
, with points of size 4
, is provided.
# Add a geom_segment() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2)
# Add a geom_text() layer
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = lifeExp), color = "white", size = 1.5)
# Set the color scale
palette <- brewer.pal(5, "RdYlBu")[-(2:4)]
# Modify the scales
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0, 0), limits = c(30, 90), position = "top") +
scale_color_gradientn(colors = palette)
# Add a title and caption
ggplot(gm2007, aes(x = lifeExp, y = country, color = lifeExp)) +
geom_point(size = 4) +
geom_segment(aes(xend = 30, yend = country), size = 2) +
geom_text(aes(label = round(lifeExp,1)), color = "white", size = 1.5) +
scale_x_continuous("", expand = c(0,0), limits = c(30,90), position = "top") +
scale_color_gradientn(colors = palette) +
labs(title = "Highest and lowest life expectancies, 2007", caption = "Source: gapminder")
Let’s continue adding to this plot in the next exercise.
Using annotate() for embellishments
In the previous exercise, we completed our basic plot. Now let’s polish it by playing with the theme and adding annotations. In this exercise, you’ll use annotate()
to add text and a curve to the plot.
The following values have been calculated for you to assist with adding embellishments to the plot:
global_mean <- mean(gm2007_full$lifeExp)
x_start <- global_mean + 4
y_start <- 5.5
x_end <- global_mean
y_end <- 7.5
Our previous plot has been assigned to plt_country_vs_lifeExp
.
# Define the theme
plt_country_vs_lifeExp +
theme_classic() +
theme(axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(color = "black"),
axis.title = element_blank(),
legend.position = "none")
# Add a vertical line
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3)
# Add text
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
annotate(
"text",
x = x_start, y = y_start,
label = "The\nglobal\naverage",
vjust = 1, size = 3, color = "grey40"
)
# Add a curve
plt_country_vs_lifeExp +
step_1_themes +
geom_vline(xintercept = global_mean, color = "grey40", linetype = 3) +
step_3_annotation +
annotate(
"curve",
x = x_start, y = y_start,
xend = x_end, yend = y_end,
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
color = "grey40"
)
Your explanatory plot clearly shows the countries with the highest and lowest life expectancy and would be great for a lay audience.
Congratulations on finishing the course! Continue learning more in Intermediate Data Visualization with ggplot2!