Introduction to R for Finance
Lore Dirick - DataCamp
Course Description
In this finance-oriented introduction to R, you will learn essential data structures such as lists and data frames and have the chance to apply that knowledge to real-world financial examples. By the end of the course, you will be comfortable with the basics of manipulating your data to perform financial analysis in R.
1 The Basics
Get comfortable with the very basics of R and learn how to use it as a calculator. Also, create your first variables in R and explore some of the base data types such as numerics and characters.
1.1 Welcome to Introduction to R for Finance!
1.1.1 Your first R script
Welcome! In the script to the right you will type R code to solve the exercises. When you hit the Submit Answer button, every line of code in the script is executed by R and you get a message that indicates whether or not your code was correct. The output of your submission is shown in the R console.
You can also execute code directly in the R Console. When you type in
the console, your submission will not be checked for correctness! Try,
for example, to type in 3 + 4
and hit Enter. R should return [1] 7
.
# Addition!
3 + 5
## [1] 8
# Subtraction!
6 - 4
## [1] 2
#
symbol in the script! This denotes a comment in your code. Comments are a great way to document your code, and are not run when you submit your answer.
One down, many more to go! Great job!
1.1.2 Arithmetic in R (1)
Let’s play around with your new calculator. First, check out these arithmetic operators, most of them should look familiar:
+
-
*
/
^
or **
%%
You might be unfamiliar with the last two. The ^
operator raises the number to its left to the power of the number to its right. For example, 3^2
is 9. The modulo returns the remainder of the division of the number to
the left by the number on the right, for example 5 modulo 3 or 5 %% 3
is 2.
Lastly, there is another useful way to execute your code besides typing
in the R Console or pressing Submit Answer. Clicking on a line of code
in the script, and then pressing Command + Enter
will execute just that line in the R Console. Try it out with the 2 + 2
line already in the script!
# Addition
2 + 2
## [1] 4
# Subtraction
4 - 1
## [1] 3
# Multiplication
3 * 4
## [1] 12
4 / 2
in the script to perform division.
# Division
4 / 2
## [1] 2
2^4
to raise 2 to the power of 4.
# Exponentiation
2^4
## [1] 16
7 %% 3
to calculate 7 modulo 3.
# Modulo
7 %% 3
## [1] 1
You’re crushing it!
1.1.3 Arithmetic in R (2)
The order in which you perform your mathematical operations is critical to get the correct answer. The correct sequence of “order of operation” is:
Parenthesis, Exponentiation, Multiplication and Division, Addition and Subtraction
Or PEMDAS for short!
This means that when you come along the expression: 20 - 8 * 2 , you know to do the multiplication first, then the subtraction, to get the correct answer of 4.
Which of these expressions would evaluate to 6?
- 4 + 8 / 2
- (14 - 2) / 2
- (2^3 * 2) / 4
- 6 - 3 * 2
That’s it! Isn’t PEMDAS great?
1.1.4 Assignment and variables (1)
It looks like you’re becoming an expert at using R as a calculator! Time
to take it one step further. These numbers you are calculating haven’t
been very descriptive. 5? 5 what? 5 apples? 5 monkeys? What if you could
assign that 5 a descriptive name like number_of_apples
, and then simply type that name whenever you want to use 5? Enter, variables.
A variable allows you to store a value or an object in R. You can then
later use this variable’s name to easily access the value or the object
that is stored within this variable. You use <-
to assign a variable:
my_money <- 100
savings
variable in the script.
# Assign 200 to savings
<- 200 savings
savings
in the script asks R prints the value to the console!
# Print the value of savings to the console
savings
## [1] 200
Excellent job creating a meaningfully named variable and printing it to
the console (which, in R, you can do without any call to a print
function!)
1.1.5 Assignment and variables (2)
Suppose you have $100 stored in my_money
, and your friend
Dan has $200 dollars. To be clear, you decide to give Dan’s money a
variable name too. You want to know how much money the two of you have
together. Now that each variable has a descriptive name, this is easy
using the arithmetic you learned earlier:
my_money + dans_money
my_money
has been defined for you.
# Assign 100 to my_money
<- 100 my_money
# Assign 200 to dans_money
<- 200 dans_money
# Add my_money and dans_money
+ dans_money my_money
## [1] 300
# Add my_money and dans_money again, save the result to our_money
<- my_money + dans_money our_money
Now you’re getting it!!
1.2 Financial returns
1.2.1 Financial returns (1)
Time for some application! Earlier, Lore taught you about financial returns. Now, its time for you to put that knowledge to work! But first, a quick review.
Assume you have $100. During January, you make a 5% return on that
money. How much do you have at the end of January? Well, you have 100%
of your starting money, plus another 5%: 100% + 5% = 105%
. In decimals, this is 1 + .05 = 1.05
. This 1.05 is the return multiplier for January, and you multiply your original $100 by it to get the amount you have at the end of January.
105 = 100 * 1.05
Or in terms of variables:
post_jan_cash <- starting_cash * jan_mult
A quick way to get the multiplier is:
multiplier = 1 + (return / 100)
# Variables for starting_cash and 5% return during January
<- 200
starting_cash <- 5
jan_ret <- 1 + (jan_ret / 100) jan_mult
post_jan_cash
.
# How much money do you have at the end of January?
<- starting_cash * jan_mult post_jan_cash
post_jan_cash
.
# Print post_jan_cash
post_jan_cash
## [1] 210
jan_mult_10
.
# January 10% return multiplier
<- 10
jan_ret_10 <- 1 + (jan_ret_10 / 100) jan_mult_10
post_jan_cash_10
using the new multiplier!
# How much money do you have at the end of January now?
<- starting_cash * jan_mult_10 post_jan_cash_10
post_jan_cash_10
to see the impact of different interest rates!
# Print post_jan_cash_10
post_jan_cash_10
## [1] 220
Great! Wouldn’t it be nice to always have 10% returns?
1.2.2 Financial returns (2)
Let’s make you some more money. If, in February, you earn another 2% on
your cash, how would you calculate the total amount at the end of
February? You already know that the amount at the end of January is $100 * 1.05 = $105
. To get from the end of January to the end of February, just use another multiplier!
$105 * 1.02 = $107.1
Which is equivalent to:
$100 * 1.05 * 1.02 = $107.1
In this last form, you see the effect of both multipliers on your original $100. In fact, this form can help you find the total return over both months. The correct way to do this is by multiplying the two multipliers together: 1.05 * 1.02 = 1.071
. This means you earned 7.1% in total over the 2 month period.
# Starting cash and returns
<- 200
starting_cash <- 4
jan_ret <- 5 feb_ret
jan_mult
and feb_mult
.
# Multipliers
<- 1 + (jan_ret / 100)
jan_mult <- 1 + (feb_ret / 100) feb_mult
starting_cash
to find your total_cash
at the end of the two months.
# Total cash at the end of the two months
<- starting_cash * jan_mult * feb_mult total_cash
total_cash
to see how your money has grown!
# Print total_cash
total_cash
## [1] 218.4
Fantastic! It feels good to make some money.
1.3 Basic data types
1.3.1 Data type exploration
To get started, here are some of R’s most basic data types:
4.5
. A special type of numeric is an integer, which is a numeric without a decimal piece. Integers must be specified like 4L
.
TRUE
and FALSE
. Capital letters are important here; true and false are not valid.
“hello world”
.
apple_stock
.
# Apple's stock price is a numeric
<- 150.45 apple_stock
credit_rating
.
# Bond credit ratings are characters
<- "AAA" credit_rating
TRUE
or FALSE
, we won’t judge!
# You like the stock market. TRUE or FALSE?
<- TRUE my_answer
my_answer
!
# Print my_answer
my_answer
## [1] TRUE
Great job!
1.3.2 What’s that data type?
Up until now, you have been determining what data type a variable is just by looks. There is actually a better way to check this.
This will return the data type (or class) of whatever variable you pass in.
The variables a
, b
, and c
have already been defined for you. You can type ls()
in the console at any time to “list” the variables currently available to you. Use the console, and class()
to decide which statement below is correct.
a
is a numeric,b
is a character,c
is a logicala
is a logical,b
is a numeric,c
is a charactera
is a numeric,b
is a numeric,c
is a logicala
is a character,b
is a character,c
is a numeric
2 Vectors and Matrices
In this chapter, you will learn all about vectors and matrices using historical stock prices for companies like Apple and IBM. You will then be able to feel confident creating, naming, manipulating, and selecting from vectors and matrices.
2.1 What is a vector?
2.1.1 c()ombine
Now is where things get fun! It is time to create your first vector.
Since this is a finance oriented course, it is only appropriate that
your first vector be a numeric vector of stock prices. Remember, you
create a vector using the combine function, c()
, and each element you add is separated by a comma.
For example, this is a vector of Apple’s stock prices from December, 2016:
apple_stock <- c(109.49, 109.90, 109.11, 109.95, 111.03, 112.12)
And this is a character vector of bond credit ratings:
credit_rating <- c(“AAA”, “AA”, “BBB”, “BB”, “B”)
# Another numeric vector
<- c(159.82, 160.02, 159.84) ibm_stock
finance
related words “stocks”, “bonds”, and “investments”, in that order.
# Another character vector
<- c("stocks", "bonds", "investments") finance
TRUE
, FALSE
, TRUE
in that order.
# A logical vector
<- c(TRUE, FALSE, TRUE) logic
Great job! You will use c() in almost all of the exercises!
2.1.2 Coerce it
It is important to remember that a vector can only be composed of one data type. This means that you cannot have both a numeric and a character in the same vector. If you attempt to do this, the lower ranking type will be coerced into the higher ranking type.
For example: c(1.5, “hello”)
results in c(“1.5”, “hello”)
where the numeric 1.5 has been coerced into the character data type.
The hierarchy for coercion is:
logical < integer < numeric < character
Logicals are coerced a bit differently depending on what the highest data type is. c(TRUE, 1.5)
will return c(1, 1.5)
where TRUE is coerced to the numeric 1 (FALSE would be converted to a 0). On the other hand, c(TRUE, “this_char”)
is converted to c(“TRUE”, “this_char”)
.
The vectors a, b, and c have been defined for you from the following commands:
a <- c(1L , “I am a character”)
b <- c(TRUE, “Hello”)
c <- c(FALSE, 2)
Which statement is correct about type conversion?
a
is a character vector,b
is an logical vector,c
is a numeric vector.a
is an integer vector,b
is an character vector,c
is a logical vector.a
is a character vector,b
is a character vector,c
is a numeric vector.
Awesome! Just remember, one type per vector!
2.1.3 Vector names()
Let’s return to the example about January and February’s returns. As a refresher, in January you earned a 5% return, and in February, an extra 2% return. Being the savvy data scientist you are, you realize that you can put these returns into a vector! That would look something like this:
ret <- c(5, 2)
This is great! Now all of the returns are in one place. However, you
could go one step further by adding names to each return in your vector.
You do this using names()
. Check this out:
names(ret) <- c(“Jan”, “Feb”)
Printing ret
now returns:
Jan Feb
5 2
Pretty cool, right?
# Vectors of 12 months of returns, and month names
<- c(5, 2, 3, 7, 8, 3, 5, 9, 1, 4, 6, 3)
ret <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") months
months
as names to ret
to create a more descriptive vector.
# Add names to ret
names(ret) <- months
ret
to see the newly named vector!
# Print out ret to see the new names!
ret
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 5 2 3 7 8 3 5 9 1 4 6 3
Aren’t vectors fun?
2.1.4 Visualize your vector
Time to try something a bit different. So far, you have been programming in the script, and looking at your data by printing it out. For a more informative visualization, try a plot!
For this exercise, you will again be working with some Apple stock data. This time it contains the prices for all of December, 2016.
The plot()
function is one of the many ways to create a graph from your data in R.
Passing in a vector will add its values to the y-axis of the graph, and
on the x-axis will be an index created from the order that your vector
is in.
Inside of plot()
, you can change the type of your graph using type =
. The default is “p”
for points, but you can also change it to “l”
for line.
apple_stock
has already been defined, and everything has been set up for you. Try running the script line-by-line using Command + Enter
on Mac or Control + Enter
on Windows while clicked on each line.
# Look at the data
apple_stock
## [1] 150.45
# Plot the data as a scatter plot
plot(apple_stock)
# Plot the data as a line graph
plot(apple_stock, type = "l")
Well done!
2.2 Vector manipulation
2.2.1 Weighted average (1)
As a finance professional, there are a number of important calculations that you will have to know. One of these is the weighted average. The weighted average allows you to calculate your portfolio return over a time period. Consider the following example:
Assume you have 40% of your cash in Apple stock, and 60% of your cash in IBM stock. If, in January, Apple earned 5% and IBM earned 7%, what was your total portfolio return?
To calculate this, take the return of each stock in your portfolio, and multiply it by the weight of that stock. Then sum up all of the results. For this example, you would do:
6.2 = 5 * .4 + 7 * .6
Or, in variable terms:
portf_ret <- apple_ret * apple_weight + ibm_ret * ibm_weight
# Weights and returns
<- 7
micr_ret <- 9
sony_ret <- .2
micr_weight <- .8 sony_weight
portf_ret
for this porfolio.
# Portfolio return
<- micr_ret * micr_weight + sony_ret * sony_weight portf_ret
Finance + R = The greatest thing the universe has ever invented.
2.2.2 Weighted average (2)
Wait a minute, Lore taught us a much better way to do this! Remember, R does arithmetic with vectors! Can you take advantage of this fact to calculate the portfolio return more efficiently? Think carefully about the following code:
ret <- c(5, 7)
weight <- c(.4, .6)
ret_X_weight <- ret * weight
sum(ret_X_weight)
[1] 6.2
First, calculate ret * weight
, which multiplies each element in the vectors together to create a new vector ret_X_weight
. All you need to do then is add up the pieces, so you use sum()
to sum up each element in the vector.
Now its your turn!
ret
and weight
for Microsoft and Sony are defined for you again, but this time, in vector form!
# Weights, returns, and company names
<- c(7, 9)
ret <- c(.2, .8)
weight <- c("Microsoft", "Sony") companies
ret
and weight
vectors.
# Assign company names to your vectors
names(ret) <- companies
names(weight) <- companies
ret
and weight
together.
# Multiply the returns and weights together
<- ret * weight ret_X_weight
ret_X_weight
to see the results.
# Print ret_X_weight
ret_X_weight
## Microsoft Sony
## 1.4 7.2
sum()
to get the total portf_ret
.
# Sum to get the total portfolio return
<- sum(ret_X_weight) portf_ret
portf_ret
and compare to the last exercise!
# Print portf_ret
portf_ret
## [1] 8.6
See! Financial math isn’t that hard!
2.2.3 Weighted average (3)
Let’s look at an example of recycling. What if you wanted to give equal weight to your Microsoft and Sony stock returns? That is, you want to be invested 50% in Microsoft and 50% in Sony.
ret <- c(7, 9)
weight <- .5
ret_X_weight <- ret * weight
ret_X_weight
[1] 3.5 4.5
ret
is a vector of length 2, and weight
is a vector of length 1. R reuses the .5
in weight
twice to make it the same length of ret
, then performs the element-wise arithmetic.
ret
, containing the returns of 3 stocks is in your workspace.
ret
to see the returns of your 3 stocks.
# Print ret
ret
## Microsoft Sony
## 7 9
1/3
to weight
. This will be the weight that each stock receives.
# Assign 1/3 to weight
<- 1/3 weight
ret_X_weight
by multiplying ret
and weight
. See how R recycles weight
?
# Create ret_X_weight
<- ret * weight ret_X_weight
sum()
the ret_X_weight
variable to create your equally weighted portf_ret
.
# Calculate your portfolio return
<- sum(ret_X_weight) portf_ret
# Vector of length 3 * Vector of length 2?
* c(.2, .6) ret
## Microsoft Sony
## 1.4 5.4
Awesome! Recycling makes multiplying vectors by numbers like .5
easy to understand!
2.2.4 Vector subsetting
Sometimes, you will only want to use specific pieces of your vectors,
and you’ll need some way to access just those parts. For example, what
if you only wanted the first month of returns from the vector of 12
months of returns? To solve this, you can subset the vector using [ ]
.
Here is the 12 month return vector:
ret <- c(5, 2, 3, 7, 8, 3, 5, 9, 1, 4, 6, 3)
Select the first month: ret[1]
.
Select the first month by name: ret[“Jan”]
.
Select the first three months: ret[1:3]
or ret[c(1, 2, 3)]
.
ret
is defined in your workspace.
# First 6 months of returns
1:6] ret[
## Microsoft Sony <NA> <NA> <NA> <NA>
## 7 9 NA NA NA NA
c()
and “Mar”
, “May”
.
# Just March and May
c("Mar","May")] ret[
## <NA> <NA>
## NA NA
# Omit the first month of returns
-1] ret[
## Sony
## 9
Well done!
2.3 Matrix - a 2D vector
2.3.1 Create a matrix!
Matrices are similar to vectors, except they are in 2 dimensions! Let’s create a 2x2 matrix “by hand” using matrix()
.
matrix(data = c(2, 3, 4, 5), nrow = 2, ncol = 2)
[,1] [,2]
[1,] 2 4
[2,] 3 5
Notice that the actual data for the matrix is passed in as a vector using c()
, and is then converted to a matrix by specifying the number of rows and columns (also known as the dimensions).
Because the matrix is just created from a vector, the following is equivalent to the above code.
my_vector <- c(2, 3, 4, 5)
matrix(data = my_vector, nrow = 2, ncol = 2)
my_vector
has been defined for you.
# A vector of 9 numbers
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9) my_vector
___
to create a 3x3 matrix from my_vector
.
# 3x3 matrix
<- matrix(data = my_vector, nrow = 3, ncol = 3) my_matrix
my_matrix
.
# Print my_matrix
my_matrix
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
byrow = TRUE
. Compare this to the example given above.
# Filling across using byrow = TRUE
matrix(data = c(2, 3, 4, 5), nrow = 2, ncol = 2, byrow = TRUE)
## [,1] [,2]
## [1,] 2 3
## [2,] 4 5
Awesome! You just created your first matrix! You’re becoming a true R wizard.
2.3.2 Matrix <- bind vectors
Often, you won’t be creating vectors like we did in the last example.
Instead, you will create them from multiple vectors that you want to
combine together. For this, it is easiest to use the functions cbind()
and rbind()
(column bind and row bind respectively). To see these in action, let’s combine two vectors of Apple and IBM stock prices:
apple <- c(109.49, 109.90, 109.11, 109.95, 111.03)
ibm <- c(159.82, 160.02, 159.84, 160.35, 164.79)
cbind(apple, ibm)
apple ibm
[1,] 109.49 159.82
[2,] 109.90 160.02
[3,] 109.11 159.84
[4,] 109.95 160.35
[5,] 111.03 164.79
rbind(apple, ibm)
[,1] [,2] [,3] [,4] [,5]
apple 109.49 109.90 109.11 109.95 111.03
ibm 159.82 160.02 159.84 160.35 164.79
Now its your turn!
# edited by cliex159
<- c(109.49, 109.90, 109.11, 109.95, 111.03, 112.12, 113.95, 113.30, 115.19, 115.19, 115.82, 115.97, 116.64, 116.95, 117.06, 116.29, 116.52, 117.26, 116.76, 116.73, 115.82)
apple <- c(159.82, 160.02, 159.84, 160.35, 164.79, 165.36, 166.52, 165.50, 168.29, 168.51, 168.02, 166.73, 166.68, 167.60, 167.33, 167.06, 166.71, 167.14, 166.19, 166.60, 165.99)
ibm <- c(59.20, 59.25, 60.22, 59.95, 61.37, 61.01, 61.97, 62.17, 62.98, 62.68, 62.58, 62.30, 63.62, 63.54, 63.54, 63.55, 63.24, 63.28, 62.99, 62.90, 62.14) micr
apple
, ibm
, and micr
stock price vectors from December, 2016 are in your workspace.
cbind()
to column bind apple
, ibm
, and micr
together, in that order, as cbind_stocks
.
# cbind the vectors together
<- cbind(apple, ibm, micr) cbind_stocks
cbind_stocks
.
# Print cbind_stocks
cbind_stocks
## apple ibm micr
## [1,] 109.49 159.82 59.20
## [2,] 109.90 160.02 59.25
## [3,] 109.11 159.84 60.22
## [4,] 109.95 160.35 59.95
## [5,] 111.03 164.79 61.37
## [6,] 112.12 165.36 61.01
## [7,] 113.95 166.52 61.97
## [8,] 113.30 165.50 62.17
## [9,] 115.19 168.29 62.98
## [10,] 115.19 168.51 62.68
## [11,] 115.82 168.02 62.58
## [12,] 115.97 166.73 62.30
## [13,] 116.64 166.68 63.62
## [14,] 116.95 167.60 63.54
## [15,] 117.06 167.33 63.54
## [16,] 116.29 167.06 63.55
## [17,] 116.52 166.71 63.24
## [18,] 117.26 167.14 63.28
## [19,] 116.76 166.19 62.99
## [20,] 116.73 166.60 62.90
## [21,] 115.82 165.99 62.14
rbind()
to row bind the three vectors together, in the same order, as rbind_stocks
.
# rbind the vectors together
<- rbind(apple, ibm, micr) rbind_stocks
rbind_stocks
.
# Print rbind_stocks
rbind_stocks
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## apple 109.49 109.90 109.11 109.95 111.03 112.12 113.95 113.30 115.19 115.19
## ibm 159.82 160.02 159.84 160.35 164.79 165.36 166.52 165.50 168.29 168.51
## micr 59.20 59.25 60.22 59.95 61.37 61.01 61.97 62.17 62.98 62.68
## [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## apple 115.82 115.97 116.64 116.95 117.06 116.29 116.52 117.26 116.76 116.73
## ibm 168.02 166.73 166.68 167.60 167.33 167.06 166.71 167.14 166.19 166.60
## micr 62.58 62.30 63.62 63.54 63.54 63.55 63.24 63.28 62.99 62.90
## [,21]
## apple 115.82
## ibm 165.99
## micr 62.14
The functions cbind()
and rbind()
are pretty common. They also work with data frames, which you’ll learn about in the next chapter!
2.3.3 Visualize your matrix
Similar to vectors, we can visualize our matrix to gain some insights about the relationships in the data.
In this exercise, you will plot the matrix of Apple and Microsoft stock prices to see the relationship between the two companies’ stock prices during December, 2016.
# edited by cliex159
<- cbind(apple, micr) apple_micr_matrix
apple_micr_matrix
is available in your workspace.
apple_micr_matrix
to get a look at the data.
# View the data
apple_micr_matrix
## apple micr
## [1,] 109.49 59.20
## [2,] 109.90 59.25
## [3,] 109.11 60.22
## [4,] 109.95 59.95
## [5,] 111.03 61.37
## [6,] 112.12 61.01
## [7,] 113.95 61.97
## [8,] 113.30 62.17
## [9,] 115.19 62.98
## [10,] 115.19 62.68
## [11,] 115.82 62.58
## [12,] 115.97 62.30
## [13,] 116.64 63.62
## [14,] 116.95 63.54
## [15,] 117.06 63.54
## [16,] 116.29 63.55
## [17,] 116.52 63.24
## [18,] 117.26 63.28
## [19,] 116.76 62.99
## [20,] 116.73 62.90
## [21,] 115.82 62.14
plot()
to create a scatter plot of Microsoft VS Apple stock prices.
# Scatter plot of Microsoft vs Apple
plot(apple_micr_matrix)
Visualizations are critical to help you understand your data!
2.3.4 cor()relation
Did you notice the relationship between the two stocks? It seems that when Apple’s stock moves up, Microsoft’s does as well. One way to capture this kind of relationship is by finding the correlation between the two stocks. Correlation is a measure of association between two things, here, stock prices, and is represented by a number from -1 to 1. A 1 represents perfect positive correlation, a -1 represents perfect negative correlation, and 0 correlation means that the stocks move independently of each other. Correlation is a common metric in finance, and it is useful to know how to calculate it in R.
The cor()
function will calculate the correlation between two vectors, or will create a correlation matrix when given a matrix.
cor(apple, micr)
[1] 0.9477011
cor(apple_micr_matrix)
apple micr
apple 1.0000000 0.9477011
micr 0.9477011 1.0000000
cor(apple, micr)
simply returned the correlation between
the two stocks. A large correlation of .9477 hints that Apple and
Microsoft’s stock prices move closely together. cor(apple_micr_matrix)
returned a matrix that shows all of the possible pairwise correlations.
The top left correlation of 1 is the correlation of Apple with itself,
which makes sense!
apple
, micr
, and ibm
are in your workspace.
apple
and ibm
.
# Correlation of Apple and IBM
cor(apple, ibm)
## [1] 0.8872467
apple
, micr
, and ibm
, in that order, named stocks
using cbind()
.
# stock matrix
<- cbind(apple, micr, ibm) stocks
stocks
matrix instead. Correlation matrices are very powerful when you have many stocks!
# cor() of all three
cor(stocks)
## apple micr ibm
## apple 1.0000000 0.9477010 0.8872467
## micr 0.9477010 1.0000000 0.9126597
## ibm 0.8872467 0.9126597 1.0000000
Great job! Correlations are a popular topic in finance. Ask Google for more information if you are interested.
2.3.5 Matrix subsetting
Just like vectors, matrices can be selected from and subsetted! To do this, you will again use [ ], but this time it will have two inputs. The basic structure is:
my_matrix[row, col]
Then:
To select the first row and first column of stocks
from the last example: stocks[1,1]
To select the entire first row, leave the col empty: stocks[1, ]
To select the first two rows: stocks[1:2, ]
or stocks[c(1,2), ]
To select an entire column, leave the row empty: stocks[, 1]
You can also select an entire column by name: stocks[, “apple”]
stocks
is in your workspace.
stocks
.
# Third row
3, ] stocks[
## apple micr ibm
## 109.11 60.22 159.84
ibm
column of stocks
.
# Fourth and fifth row of the ibm column
4:5, "ibm"] stocks[
## [1] 160.35 164.79
apple
and micr
columns from stocks
using c()
inside the brackets.
# apple and micr columns
c("apple","micr")] stocks[ ,
## apple micr
## [1,] 109.49 59.20
## [2,] 109.90 59.25
## [3,] 109.11 60.22
## [4,] 109.95 59.95
## [5,] 111.03 61.37
## [6,] 112.12 61.01
## [7,] 113.95 61.97
## [8,] 113.30 62.17
## [9,] 115.19 62.98
## [10,] 115.19 62.68
## [11,] 115.82 62.58
## [12,] 115.97 62.30
## [13,] 116.64 63.62
## [14,] 116.95 63.54
## [15,] 117.06 63.54
## [16,] 116.29 63.55
## [17,] 116.52 63.24
## [18,] 117.26 63.28
## [19,] 116.76 62.99
## [20,] 116.73 62.90
## [21,] 115.82 62.14
3 Data Frames
Arguably the most important data structure in R, the data frame is what most of your data will take the form of. It combines the structure of a matrix with the flexibility of having different types of data in each column.
3.1 What is a data frame?
3.1.1 Create your first data.frame()
Data frames are great because of their ability to hold a different type of data in each column. To get started, let’s use the data.frame()
function to create a data frame of your business’s future cash flows. Here are the variables that will be in the data frame:
company
- The company that is paying you the cash flow (A or B).
cash_flow
- The amount of money a company will receive.
year
- The number of years from now that you receive the cash flow.
To create the data frame, you do the following:
data.frame(company = c("A", "A", "B"), cash_flow = c(100, 200, 300), year = c(1, 3, 2))
company cash_flow year
1 A 100 1
2 A 200 3
3 B 300 2
Like matrices, data frames are created from vectors, so this code would have also worked:
company <- c("A", "A", "B")
cash_flow <- c(100, 200, 300)
year <- c(1, 3, 2)
data.frame(company, cash_flow, year)
company
, cash_flow
, and year
variables have been defined for you.
# Variables
<- c("A", "A", "A", "B", "B", "B", "B")
company <- c(1000, 4000, 550, 1500, 1100, 750, 6000)
cash_flow <- c(1, 3, 4, 1, 2, 4, 5) year
company
, cash_flow
, and year
in that order. Assign it to cash
You will use this data frame throughout the rest of the chapter!
# Data frame
<- data.frame(company, cash_flow, year) cash
# Print cash
cash
## company cash_flow year
## 1 A 1000 1
## 2 A 4000 3
## 3 A 550 4
## 4 B 1500 1
## 5 B 1100 2
## 6 B 750 4
## 7 B 6000 5
Great job creating your first data frame!
3.1.2 What goes in a data frame?
Knowledge test! What kind of vectors can you not create a data frame from?
- Character
- Logical
- Numeric
- Integer
- A data frame can be created from all of these
Awesome! Data frames can be built from vectors of any of the base types!
3.1.3 Making head()s and tail()s of your data with some str()ucture
Time to introduce a few simple, but very useful functions.
head()
- Returns the first few rows of a data frame. By default, 6. To change this, use head(cash, n = ___)
tail()
- Returns the last few rows of a data frame. By default, 6. To change this, use tail(cash, n = ___)
str()
- Check the structure
of an object. This fantastic function will show you the data type of
the object you pass in (here, data.frame), and will list each column
variable along with its data type.
With a small data set such as yours, head()
and tail()
are not incredibly useful, but imagine if you had a data frame of hundreds or thousands of rows!
head()
on cash
to see the first 4 rows.
# Call head() for the first 4 rows
head(cash, n = 4)
## company cash_flow year
## 1 A 1000 1
## 2 A 4000 3
## 3 A 550 4
## 4 B 1500 1
tail()
on cash
to see the last 3 rows.
# Call tail() for the last 3 rows
tail(cash, n = 3)
## company cash_flow year
## 5 B 1100 2
## 6 B 750 4
## 7 B 6000 5
str()
on cash
to check out the structure of your data frame. (You might notice that the class of company
is a Factor
and not a character
. Do not fear! This will be covered in Chapter 4. For now, don’t worry about it.)
# Call str()
str(cash)
## 'data.frame': 7 obs. of 3 variables:
## $ company : chr "A" "A" "A" "B" ...
## $ cash_flow: num 1000 4000 550 1500 1100 750 6000
## $ year : num 1 3 4 1 2 4 5
Success!!
3.1.4 Naming your columns / rows
Let’s look at cash
again:
cash
comp cash yr
1 A 1000 1
2 A 4000 3
3 A 550 4
4 B 1500 1
5 B 1100 2
6 B 750 4
7 B 6000 5
Wait, that’s not right! It looks like someone has changed your column names! Don’t worry, you can change them back using colnames()
just like you did with names()
back with vectors.
Similarly, you can change the row names using rownames()
, but this is less common.
cash
is in your workspace.
colnames()
and assigning a character vector of “company”
, “cash_flow”
, and “year”
in that order.
# Fix your column names
colnames(cash) <- c("company", "cash_flow", "year")
colnames()
of cash
.
# Print out the column names of cash
colnames(cash)
## [1] "company" "cash_flow" "year"
Fantastic! Your column names are much better.
3.2 Data frame manipulation
3.2.1 Accessing and subsetting data frames (1)
Even more often than with vectors, you are going to want to subset your
data frame or access certain columns. Again, one of the ways to do this
is to use [ ]
. The notation is just like matrices! Here are some examples:
Select the first row: cash[1, ]
Select the first column: cash[ ,1]
Select the first column by name: cash[ ,“company”]
cash
.
# Third row, second column
3, 2] cash[
## [1] 550
“year”
column of cash
.
# Fifth row of the "year" column
5, "year"] cash[
## [1] 2
Great job! Subsetting data frames is a great skill to learn!
3.2.2 Accessing and subsetting data frames (2)
As you might imagine, selecting a specific column from a data
frame is a common manipulation. So common, in fact, that it was given
its own shortcut, the \(</code>. The following return the same answer:</p> <pre><code>cash\)cash_flow
[1] 1000 4000 550 1500 1100 750 6000
cash[,“cash_flow”]
[1] 1000 4000 550 1500 1100 750 6000
Useful right? Try it out!
“year”
column from cash
using $
.
# Select the year column
$year cash
## [1] 1 3 4 1 2 4 5
“cash_flow”
column from cash
using $
and multiply it by 2.
# Select the cash_flow column and multiply by 2
$cash_flow * 2 cash
## [1] 2000 8000 1100 3000 2200 1500 12000
NULL
. Run the code that deletes “company”
.
# Delete the company column
$company <- NULL cash
cash
again.
# Print cash again
cash
## cash_flow year
## 1 1000 1
## 2 4000 3
## 3 550 4
## 4 1500 1
## 5 1100 2
## 6 750 4
## 7 6000 5
The $
is a great shortcut to use with data frames! Learn to love it!
3.2.3 Accessing and subsetting data frames (3)
Often, just simply selecting a column from a data frame is not all you
want to do. What if you are only interested in the cash flows from
company A? For more flexibility, try subset()
!
subset(cash, company == "A")
company cash_flow year
1 A 1000 1
2 A 4000 3
3 A 550 4
There are a few important things happening here:
subset()
is the name of your data frame, cash
.
company
in quotes!
==
is the equality operator. It tests to find where two
things are equal, and returns a logical vector. There is a lot more to
learn about these relational operators, and you can learn all about them in the second finance course, Intermediate R for Finance!
subset()
to select only the rows of cash
corresponding to company B.
# Rows about company B
subset(cash, company == "B")
## cash_flow year
## 4 1500 1
## 5 1100 2
## 6 750 4
## 7 6000 5
subset()
rows that have cash flows due in 1 year.
# Rows with cash flows due in 1 year
subset(cash, year == 1)
## cash_flow year
## 1 1000 1
## 4 1500 1
Great! subset()
allows you to create more powerful ways to select groups from your data.
3.2.4 Adding new columns
In a perfect world, you could be 100% certain that you will receive all of your cash flows. But, since these are predictions about the future, there is always a chance that someone won’t be able to pay! You decide to run some analysis about a worst case scenario where you only receive half of your expected cash flow. To save the worst case scenario for later analysis, you decide to add it as a new column to the data frame!
cash$half_cash <- cash$cash_flow * .5
cash
company cash_flow year half_cash
1 A 1000 1 500
2 A 4000 3 2000
3 A 550 4 275
4 B 1500 1 750
5 B 1100 2 550
6 B 750 4 375
7 B 6000 5 3000
And that’s it! Creating new columns in your data frame is as simple as assigning the new information to data_frame\(new_column</code>. Often, the newly created column is some transformation of existing columns, so the <code>\)
operator really comes in handy here!
quarter_cash
.
# Quarter cash flow scenario
$quarter_cash <- cash$cash_flow * .25 cash
year
) to receive your money? Add a new column double_year
with this scenario.
# Double year scenario
$double_year <- cash$year * 2 cash
Great! See how useful the $
is for readability?
3.3 Present value
3.3.1 Present value of projected cash flows (1)
Time for some analysis! Earlier, Lore introduced the idea of present value. You will use that idea in the next two exercises, so here is another example.
If you expect a cash flow of $100 to be received 1 year from now, what is the present value of that cash flow at a 5% interest rate? To calculate this, you discount the cash flow to get it in terms of today’s dollars. The general formula for this is:
present_value <- cash_flow * (1 + interest / 100) ^ -year
95.238 = 100 * (1.05) ^ -1
Another way to think about this is to reverse the problem. If you have $95.238 today, and it earns 5% over the next year, how much money do you have at the end of the year? We know how to do this problem from way back in chapter 1! Find the multiplier that corresponds to 5% and multiply by $95.238!
100 = 95.238 * (1.05)
Aha! To discount your money, just do the reverse of what you did with stock returns in chapter 1.
present_value_4k
.
# Present value of $4000, in 3 years, at 5%
<- 4000 * (1.05) ^ -3 present_value_4k
cash_flow
at once! Use cash\(cash_flow</code>, <code>cash\)year
and the general formula to calculate the present value of all of your cash flows at 5% interest. Add it to cash
as the column present_value
.
# Present value of all cash flows
$present_value <- cash$cash_flow * (1.05) ^ -cash$year cash
cash
to see your new column.
# Print out cash
cash
## cash_flow year quarter_cash double_year present_value
## 1 1000 1 250.0 2 952.3810
## 2 4000 3 1000.0 6 3455.3504
## 3 550 4 137.5 8 452.4864
## 4 1500 1 375.0 2 1428.5714
## 5 1100 2 275.0 4 997.7324
## 6 750 4 187.5 8 617.0269
## 7 6000 5 1500.0 10 4701.1570
Great! Learning to calculate present values is useful for any finance calculation.
3.3.2 Present value of projected cash flows (2)
Amazing! You are almost done with this chapter, and you are becoming a true wizard of data frames and finance. Before you move on, let’s answer a few more questions.
You now have a column for present_value
, but you want to report the total amount of that column to your board members. Calculating this part is easy, use the sum()
function you learned earlier to add up the elements of cash$present_value
.
However, you also want to know how much company A and company B individually contribute to the total present value. Do you remember how to separate the rows of your data frame to only include company A or B?
cash_A <- subset(cash, company == "A")
sum(cash_A$present_value)
[1] 4860.218
sum()
function to calculate the total present_value
of cash
. Assign it to total_pv
.
# Total present value of cash
<- sum(cash$present_value) total_pv
cash
to only include rows about company B to create cash_B
.
# Company B information
<- subset(cash, company == "B") cash_B
sum()
and cash_B
to calculate the total present_value
from company B. Assign it to total_pv_B
.
# Total present value of cash_B
<- sum(cash_B$present_value) total_pv_B
4 Factors
Questions with answers that fall into a limited number of categories can be classified as factors. In this chapter, you will use bond credit ratings to learn all about creating, ordering, and subsetting factors.
4.1 What is a factor?
4.1.1 Create a factor
Bond credit ratings are common in the fixed income side of the finance world as a simple measure of how “risky” a certain bond might be. Here, riskiness can be defined as the probability of default, which means an inability to pay back your debts. The Standard and Poor’s and Fitch credit rating agency has defined the following ratings, from least likely to default to most likely:
AAA, AA, A, BBB, BB, B, CCC, CC, C, D
This is a perfect example of a factor! It is a categorical variable that takes on a limited number of levels.
To create a factor in R, use the factor()
function, and pass in a vector that you want to be converted into a factor.
Suppose you have a portfolio of 7 bonds with these credit ratings:
credit_rating <- c(“AAA”, “AA”, “A”, “BBB”, “AA”, “BBB”, “A”)
To create a factor from this:
factor(credit_rating)
[1] AAA AA A BBB AA BBB A
Levels: A AA AAA BBB
A new character vector, credit_rating
has been created for you in the code for this exercise.
credit_rating
into a factor using factor()
. Assign it to credit_factor
.
# credit_rating character vector
<- c("BB", "AAA", "AA", "CCC", "AA", "AAA", "B", "BB")
credit_rating
# Create a factor from credit_rating
<- factor(credit_rating) credit_factor
credit_factor
.
# Print out your new factor
credit_factor
## [1] BB AAA AA CCC AA AAA B BB
## Levels: AA AAA B BB CCC
str()
on credit_rating
to note the structure.
# Call str() on credit_rating
str(credit_rating)
## chr [1:8] "BB" "AAA" "AA" "CCC" "AA" "AAA" "B" "BB"
str()
on credit_factor
and compare the structure to credit_rating
.
# Call str() on credit_factor
str(credit_factor)
## Factor w/ 5 levels "AA","AAA","B",..: 4 2 1 5 1 2 3 4
Fantastic! That wasn’t too bad, right?
4.1.2 Factor levels
Accessing the unique levels of your factor is simple enough by using the levels()
function. You can also use this to rename your factor levels!
credit_factor
[1] AAA AA A BBB AA BBB A
Levels: A AA AAA BBB
levels(credit_factor)
[1] "A" "AA" "AAA" "BBB"
levels(credit_factor) <- c("1A", "2A", "3A", "3B")
credit_factor
[1] 3A 2A 1A 3B 2A 3B 1A
Levels: 1A 2A 3A 3B
The credit_factor
variable you created in the last exercise is available in your workspace.
levels()
on credit_factor
to identify the unique levels.
# Identify unique levels
levels(credit_factor)
## [1] "AA" "AAA" "B" "BB" "CCC"
“1A”
, “2A”
notation as in the example, rename the levels of credit_factor
. Pay close attention to the level order!
# Rename the levels of credit_factor
levels(credit_factor) <- c("2A", "3A", "1B", "2B", "3C")
credit_factor
.
# Print credit_factor
credit_factor
## [1] 2B 3A 2A 3C 2A 3A 1B 2B
## Levels: 2A 3A 1B 2B 3C
Great job!
4.1.3 Factor summary
As any good bond investor would do, you would like to keep track of how
many bonds you are holding of each credit rating. A way to present a
table of the counts of each bond credit rating would be great! Luckily
for you, the summary()
function for factors can help you with that.
The character vector credit_rating
and the factor credit_factor
are both in your workspace.
summary()
on credit_rating
. Does this seem useful?
# Summarize the character vector, credit_rating
summary(credit_rating)
## Length Class Mode
## 8 character character
summary()
again, but this time on credit_factor
.
# Summarize the factor, credit_factor
summary(credit_factor)
## 2A 3A 1B 2B 3C
## 2 2 1 2 1
Factor summaries are much more useful for tabulating data!
4.1.4 Visualize your factor
You can also visualize the table that you created in the last example by using a bar chart. A bar chart is a type of graph that displays groups of data using rectangular bars where the height of each bar represents the number of counts in that group.
The plot()
function can again take care of all of the magic for you, check it out!
Note that in the example below, you are creating the plot from a factor and not a character vector. R will throw an error if you try and plot a character vector!
credit_factor
is in your workspace.
credit_factor
to create your first bar chart!
# Visualize your factor!
plot(credit_factor)
Awesome bar chart!
4.1.5 Bucketing a numeric variable into a factor
Your old friend Dan sent you a list of 50 AAA rated bonds called AAA_rank
,
with each bond having an additional number from 1-100 describing how
profitable he thinks that bond will be (100 being the most profitable).
You are interested in doing further analysis on his suggestions, but
first it would be nice if the bonds were bucketed by their ranking
somehow. This would help you create groups of bonds, from least
profitable to most profitable, to more easily analyze them.
This is a great example of creating a factor from a numeric vector. The easiest way to do this is to use cut()
. Below, Dan’s 1-100 ranking is bucketed into 5 evenly spaced groups. Note that the (
in the factor levels means we do not include the number beside it in that group, and the ]
means that we do include that number in the group.
head(AAA_rank)
[1] 31 48 100 53 85 73
AAA_factor <- cut(x = AAA_rank, breaks = c(0, 20, 40, 60, 80, 100))
head(AAA_factor)
[1] (20,40] (40,60] (80,100] (40,60] (80,100] (60,80]
Levels: (0,20] (20,40] (40,60] (60,80] (80,100]
In the cut()
function, using breaks =
allows you to specify the groups that you want R to bucket your data by!
# edited by cliex159
= c(9, 88, 74, 94, 44, 59, 81, 67, 48, 16, 58, 72, 62, 31, 65, 93, 49, 21, 68, 33, 32, 56, 51, 56, 38, 85, 9, 23, 91, 25, 11, 95, 84, 31, 33, 1, 13, 38, 34, 15, 29, 50, 51, 53, 20, 75, 83, 52, 39, 11) AAA_rank
breaks =
use a vector from 0 to 100 where each element is 25 numbers apart. Assign it to AAA_factor
.
# Create 4 buckets for AAA_rank using cut()
<- cut(x = AAA_rank, breaks = c(0, 25, 50, 75, 100)) AAA_factor
levels()
to rename the levels to “low”
, “medium”
, “high”
, and “very_high”
, in that order.
# Rename the levels
levels(AAA_factor) <- c("low", "medium", "high", "very_high")
AAA_factor
.
# Print AAA_factor
AAA_factor
## [1] low very_high high very_high medium high very_high
## [8] high medium low high high high medium
## [15] high very_high medium low high medium medium
## [22] high high high medium very_high low low
## [29] very_high low low very_high very_high medium medium
## [36] low low medium medium low medium medium
## [43] high high low high very_high high medium
## [50] low
## Levels: low medium high very_high
AAA_factor
to visualize your work!
# Plot AAA_factor
plot(AAA_factor)
Great! Sometimes factors are easier to plot than numerics.
4.2 Ordering and subsetting factors
4.2.1 Create an ordered factor
Look at the plot created over on the right. It looks great, but look at the order of the bars! No order was specified when you created the factor, so, when R tried to plot it, it just placed the levels in alphabetical order. By now, you know that there is an order to credit ratings, and your plots should reflect that!
As a reminder, the order of credit ratings from least risky to most risky is:
AAA, AA, A, BBB, BB, B, CCC, CC, C, D
To order your factor, there are two options.
When creating a factor, specify ordered = TRUE
and add unique levels in order from least to greatest:
credit_rating <- c("AAA", "AA", "A", "BBB", "AA", "BBB", "A")
credit_factor_ordered <- factor(credit_rating, ordered = TRUE,
levels = c("AAA", "AA", "A", "BBB"))
For an existing unordered factor like credit_factor
, use the ordered()
function:
ordered(credit_factor, levels = c(“AAA”, “AA”, “A”, “BBB”))
Both ways result in:
credit_factor_ordered
[1] AAA AA A BBB AA BBB A
Levels: AAA < AA < A < BBB
Notice the <
specifying the order of the levels that was not there before!
credit_rating
is in your workspace.
unique()
function with credit_rating
to print only the unique words in the character vector. These will be your levels.
# Use unique() to find unique words
unique(credit_rating)
## [1] "BB" "AAA" "AA" "CCC" "B"
factor()
to create an ordered factor for credit_rating
and store it as credit_factor_ordered
. Make sure to list the levels from least to greatest in terms of risk!
# Create an ordered factor
<- factor(credit_rating, ordered = TRUE, levels = c("AAA", "AA", "BB", "B", "CCC")) credit_factor_ordered
credit_factor_ordered
and note the new order of the bars.
# Plot credit_factor_ordered
plot(credit_factor_ordered)
Awesome! Ordered factors are great for plotting or creating tables with a predefined order.
4.2.2 Subsetting a factor
You can subset factors in a similar way that you subset vectors. As usual, [ ]
is the key! However, R has some interesting behavior when you want to
remove a factor level from your analysis. For example, what if you
wanted to remove the AAA bond from your portfolio?
credit_factor
[1] AAA AA A BBB AA BBB A
Levels: BBB < A < AA < AAA
credit_factor[-1]
[1] AA A BBB AA BBB A
Levels: BBB < A < AA < AAA
R removed the AAA bond at the first position, but left the AAA level
behind! If you were to plot this, you would end up with the bar chart
over to the right. A better plan would have been to tell R to drop the
AAA level entirely. To do that, add drop = TRUE
:
credit_factor[-1, drop = TRUE]
[1] AA A BBB AA BBB A
Levels: BBB < A < AA
That’s what you wanted!
“A”
bonds from positions 3 and 7 of credit_factor
. For now, do not use drop = TRUE
. Assign this to keep_level
.
# Remove the A bonds at positions 3 and 7. Don't drop the A level.
<- credit_factor[-c(3,7)] keep_level
keep_level
.
# Plot keep_level
plot(keep_level)
“A”
from credit_factor
again, but this time use drop = TRUE
. Assign this to drop_level
.
# Remove the A bonds at positions 3 and 7. Drop the A level.
<- credit_factor[-c(3,7), drop = TRUE] drop_level
drop_level
.
# Plot drop_level
plot(drop_level)
Great! The drop
argument will help you get rid of those pesky factor levels that stick around.
4.2.3 stringsAsFactors
Do you remember back in the data frame chapter when you used str()
on your cash
data frame? This was the output:
str(cash)
'data.frame': 3 obs. of 3 variables:
$ company : Factor w/ 2 levels "A","B": 1 1 2
$ cash_flow: num 100 200 300
$ year : num 1 3 2
See how the company
column has been converted to a factor? R’s default behavior when creating data frames is to convert all characters into factors.
This has caused countless novice R users a headache trying to figure
out why their character columns are not working properly, but not you!
You will be prepared!
To turn off this behavior:
cash <- data.frame(company, cash_flow, year, stringsAsFactors = FALSE)
str(cash)
'data.frame': 3 obs. of 3 variables:
$ company : chr "A" "A" "B"
$ cash_flow: num 100 200 300
$ year : num 1 3 2
credit_rating
and bond_owners
have been defined for you. bond_owners
is a character vector of the names of some of your friends.
# Variables
<- c("AAA", "A", "BB")
credit_rating <- c("Dan", "Tom", "Joe") bond_owners
bonds
from credit_rating
and bond_owners
, in that order, and use stringsAsFactors = FALSE
.
# Create the data frame of character vectors, bonds
<- data.frame(credit_rating, bond_owners, stringsAsFactors = FALSE) bonds
str()
to confirm that both columns are characters.
# Use str() on bonds
str(bonds)
## 'data.frame': 3 obs. of 2 variables:
## $ credit_rating: chr "AAA" "A" "BB"
## $ bond_owners : chr "Dan" "Tom" "Joe"
bond_owners
would not be a useful factor, but credit_rating
could be! Create a new column in bonds
called credit_factor
using $
which is created from credit_rating
as a correctly ordered factor.
# Create a factor column in bonds called credit_factor from credit_rating
$credit_factor <- factor(bonds$credit_rating, ordered = TRUE, levels = c("AAA","A","BB")) bonds
str()
again to confirm that credit_factor
is an ordered factor.
# Use str() on bonds again
str(bonds)
## 'data.frame': 3 obs. of 3 variables:
## $ credit_rating: chr "AAA" "A" "BB"
## $ bond_owners : chr "Dan" "Tom" "Joe"
## $ credit_factor: Ord.factor w/ 3 levels "AAA"<"A"<"BB": 1 2 3
5 Lists
Wouldn’t it be nice if there was a way to hold related vectors, matrices, or data frames together in R? In this final chapter, you will explore lists and many of their interesting features by building a small portfolio of stocks.
5.1 What is a list?
5.1.1 Create a list
Just like a grocery list, lists in R can be used to hold together items
of different data types. Creating a list is, you guessed it, as simple
as using the list()
function. You could say that a list is a kind of super data type: you
can store practically any piece of information in it! Create a list like
so:
words <- c("I <3 R")
numbers <- c(42, 24)
my_list <- list(words, numbers)
my_list
[[1]]
[1] "I <3 R"
[[2]]
[1] 42 24
Below, you will create your first list from some of the data you have already worked with!
# List components
<- "Apple and IBM"
name <- c(109.49, 109.90, 109.11, 109.95, 111.03)
apple <- c(159.82, 160.02, 159.84, 160.35, 164.79)
ibm <- cor(cbind(apple, ibm)) cor_matrix
list()
to create a list of name
, apple
, ibm
, and cor_matrix
, in that order, and assign it to portfolio
.
# Create a list
<- list(name, apple, ibm, cor_matrix) portfolio
portfolio
.
# View your first list
portfolio
## [[1]]
## [1] "Apple and IBM"
##
## [[2]]
## [1] 109.49 109.90 109.11 109.95 111.03
##
## [[3]]
## [1] 159.82 160.02 159.84 160.35 164.79
##
## [[4]]
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
Awesome! Lists are great for holding groups of related data structures together.
5.1.2 Named lists
Knowing how forgetful you are, you decide it would be important to add names to your list so you can remember what each element is describing. There are two ways to do this!
You could name the elements as you create the list with the form name = value
:
my_list <- list(my_words = words, my_numbers = numbers)
Or, if the list was already created, you could use names()
:
my_list <- list(words, numbers)
names(my_list) <- c("my_words", "my_numbers")
Both would result in:
my_list
$my_words
[1] "I <3 R"
$my_numbers
[1] 42 24
portfolio
list is available to work with.
names()
to add the following names to your list: “portfolio_name”
, “apple”
, “ibm”
, “correlation”
, in that order.
# Add names to your portfolio
names(portfolio) <- c("portfolio_name", "apple", "ibm", "correlation")
portfolio
to see your newly named list.
# Print the named portfolio
portfolio
## $portfolio_name
## [1] "Apple and IBM"
##
## $apple
## [1] 109.49 109.90 109.11 109.95 111.03
##
## $ibm
## [1] 159.82 160.02 159.84 160.35 164.79
##
## $correlation
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
Adding names to your list will make them much easier to understand!
5.1.3 Access elements in a list
Subsetting a list is similar to subsetting a vector or data frame, with one extra useful operation.
To access the elements in the list, use [ ]
. This will always return another list.
my_list[1]
$my_words
[1] "I <3 R"
my_list[c(1,2)]
$my_words
[1] "I <3 R"
$my_numbers
[1] 42 24
To pull out the data inside each element of your list, use [[ ]]
.
my_list[[1]]
[1] "I <3 R"
If your list is named, you can use the \(</code> operator: <code>my_list\)my_words
. This is the same as using [[ ]]
to return the inner data.
portfolio
named list is available for use.
portfolio
using [ ]
and c()
.
# Second and third elements of portfolio
c(2,3)] portfolio[
## $apple
## [1] 109.49 109.90 109.11 109.95 111.03
##
## $ibm
## [1] 159.82 160.02 159.84 160.35 164.79
$
to access the correlation
data.
# Use $ to get the correlation data
$correlation portfolio
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
Notice how the use of $
in lists is similar to data frames!
5.1.4 Adding to a list
Once you create a list, you aren’t stuck with it forever. You can
add new elements to it whenever you want! Say you want to add your
friend Dan’s favorite movie to your list. You can do so using \(</code> like you did when adding new columns to data frames.</p> <pre><code>my_list\)dans_movie <- “StaR Wars”
my_list
$my_words
[1] “I <3 R”
$my_numbers
[1] 42 24
$dans_movie
[1] “StaR Wars”
You could have also used c()
to add another element to the list: c(my_list, dans_movie = “StaR Wars”)
. This can be useful if you want to add multiple elements to your list at once.
portfolio
is the variable weight
describing how invested you are in Apple and IBM. Fill in the ___
correctly so that you are invested 20% in Apple and 80% in IBM. Remember to use decimal numbers, not percentages!
# Add weight: 20% Apple, 80% IBM
$weight <- c(apple = .2, ibm = .8) portfolio
portfolio
to see the weight
element.
# Print portfolio
portfolio
## $portfolio_name
## [1] "Apple and IBM"
##
## $apple
## [1] 109.49 109.90 109.11 109.95 111.03
##
## $ibm
## [1] 159.82 160.02 159.84 160.35 164.79
##
## $correlation
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
##
## $weight
## apple ibm
## 0.2 0.8
$
. Create weight
to be invested 30% in Apple and 70% in IBM.
# Change the weight variable: 30% Apple, 70% IBM
$weight <- c(apple = .3, ibm = .7) portfolio
portfolio
again to see your changes.
# Print portfolio to see the changes
portfolio
## $portfolio_name
## [1] "Apple and IBM"
##
## $apple
## [1] 109.49 109.90 109.11 109.95 111.03
##
## $ibm
## [1] 159.82 160.02 159.84 160.35 164.79
##
## $correlation
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
##
## $weight
## apple ibm
## 0.3 0.7
Great job!
5.1.5 Removing from a list
The natural next step is to learn how to remove elements from a list.
You decide that even though Dan is your best friend, you don’t want his
info in your list. To remove dans_movie
:
my_list$dans_movie <- NULL
my_list
$my_words
[1] "I <3 R"
$my_numbers
[1] 42 24
Using NULL
is the easiest way to remove an element from
your list! If your list is not named, you can also remove elements by
position using my_list[1] <- NULL
or my_list[[1]] <- NULL
.
portfolio
. It seems that someone has added microsoft
stock that you did not buy!
# Take a look at portfolio
portfolio
## $portfolio_name
## [1] "Apple and IBM"
##
## $apple
## [1] 109.49 109.90 109.11 109.95 111.03
##
## $ibm
## [1] 159.82 160.02 159.84 160.35 164.79
##
## $correlation
## apple ibm
## apple 1.0000000 0.9131575
## ibm 0.9131575 1.0000000
##
## $weight
## apple ibm
## 0.3 0.7
microsoft
element of portfolio
using NULL
.
# Remove the microsoft stock prices from your portfolio
$microsoft <- NULL portfolio
Awesome! Now the list only has your information again. Sorry, Dan!
5.2 A few list creating functions
5.2.1 Split it
Often, you will have data for multiple groups together in one data frame. The cash
data frame was an example of this back in Chapter 3. There were cash_flow
and year
columns for two groups (companies A and B). What if you wanted to split
up this data frame into two separate data frames divided by company
? In the next exercise, you will explore why you might want to do this, but first let’s explore how to make this happen using the split()
function.
Create a grouping
to split on, and use split()
to create a list of two data frames.
grouping <- cash$company
split_cash <- split(cash, grouping)
split_cash
$A
company cash_flow year
1 A 1000 1
2 A 4000 3
3 A 550 4
$B
company cash_flow year
4 B 1500 1
5 B 1100 2
6 B 750 4
7 B 6000 5
To get your original data frame back, use unsplit(split_cash, grouping)
.
cash
data frame is available in your workspace.
grouping
from the year
column.
# Define grouping from year
<- cash$year grouping
split()
to split cash
into a list of 5 data frames separated by year
. Assign this to split_cash
.
# Split cash on your new grouping
<- split(cash, grouping) split_cash
split_cash
.
# Look at your split_cash list
split_cash
## $`1`
## cash_flow year quarter_cash double_year present_value
## 1 1000 1 250 2 952.381
## 4 1500 1 375 2 1428.571
##
## $`2`
## cash_flow year quarter_cash double_year present_value
## 5 1100 2 275 4 997.7324
##
## $`3`
## cash_flow year quarter_cash double_year present_value
## 2 4000 3 1000 6 3455.35
##
## $`4`
## cash_flow year quarter_cash double_year present_value
## 3 550 4 137.5 8 452.4864
## 6 750 4 187.5 8 617.0269
##
## $`5`
## cash_flow year quarter_cash double_year present_value
## 7 6000 5 1500 10 4701.157
unsplit()
to combine the data frames again. Assign this to original_cash
.
# Unsplit split_cash to get the original data back.
<- unsplit(split_cash, grouping) original_cash
original_cash
to compare to the first cash
data frame.
# Print original_cash
original_cash
## cash_flow year quarter_cash double_year present_value
## 1 1000 1 250.0 2 952.3810
## 2 4000 3 1000.0 6 3455.3504
## 3 550 4 137.5 8 452.4864
## 4 1500 1 375.0 2 1428.5714
## 5 1100 2 275.0 4 997.7324
## 6 750 4 187.5 8 617.0269
## 7 6000 5 1500.0 10 4701.1570
Great job! This is a very important concept for more advanced data wrangling.
5.2.2 Split-Apply-Combine
A common data science problem is to split your data frame by a grouping, apply some transformation to each group, and then recombine those pieces back into one data frame. This is such a common class of problems in R that it has been given the name split-apply-combine. In Intermediate R for Finance, you will explore a number of these problems and functions that are useful when solving them, but, for now, let’s do a simple example.
Suppose, for the cash
data frame, you are interested in doubling the cash_flow
for company A, and tripling it for company B:
grouping <- cash$company split_cash <- split(cash, grouping)
# We can access each list element's cash_flow column by: split_cash$A$cash_flow [1] 1000 4000 550 split_cash$A$cash_flow <- split_cash$A$cash_flow * 2 split_cash$B$cash_flow <- split_cash$B$cash_flow * 3 new_cash <- unsplit(split_cash, grouping)
Take a look again at how you access the cash_flow
column. The first \(</code> is to access the <code>A</code> element of the <code>split_cash</code> list. The second <code>\)
is to access the cash_flow
column of the data frame in A
.
split_cash
data frame is available for you. Also, the grouping
that was used to split cash
is available.
split_cash
to get a look at the list.
# Print split_cash
split_cash
## $`1`
## cash_flow year quarter_cash double_year present_value
## 1 1000 1 250 2 952.381
## 4 1500 1 375 2 1428.571
##
## $`2`
## cash_flow year quarter_cash double_year present_value
## 5 1100 2 275 4 997.7324
##
## $`3`
## cash_flow year quarter_cash double_year present_value
## 2 4000 3 1000 6 3455.35
##
## $`4`
## cash_flow year quarter_cash double_year present_value
## 3 550 4 137.5 8 452.4864
## 6 750 4 187.5 8 617.0269
##
## $`5`
## cash_flow year quarter_cash double_year present_value
## 7 6000 5 1500 10 4701.157
cash_flow
column for company B
in split_cash
.
# Print the cash_flow column of B in split_cash
$B$cash_flow split_cash
## NULL
cash_flow
for company A to 0
.
# Set the cash_flow column of company A in split_cash to 0
$A$cash_flow <- 0 split_cash
grouping
to unsplit()
the split_cash
data frame. Assign this to cash_no_A
.
# Use the grouping to unsplit split_cash
<- unsplit(split_cash, grouping) cash_no_A
cash_no_A
to see the modified data frame.
# Print cash_no_A
cash_no_A
## cash_flow year quarter_cash double_year present_value
## 1 1000 1 250.0 2 952.3810
## 2 4000 3 1000.0 6 3455.3504
## 3 550 4 137.5 8 452.4864
## 4 1500 1 375.0 2 1428.5714
## 5 1100 2 275.0 4 997.7324
## 6 750 4 187.5 8 617.0269
## 7 6000 5 1500.0 10 4701.1570
Great job! You will learn much more about this, and the apply()
functions in the second course.
5.2.3 Attributes
You have made it to the last exercise in the course! Congrats! Let’s finish up with an easy one.
Attributes are a bit of extra metadata about your data structure. Some
of the most common attributes are: row names and column names,
dimensions, and class. You can use the attributes()
function to return a list of attributes about the object you pass in. To access a specific attribute, you can use the attr()
function.
Exploring the attributes of cash
:
attributes(cash)
$names
[1] "company" "cash_flow" "year"
$row.names
[1] 1 2 3 4 5 6 7
$class
[1] "data.frame"
attr(cash, which = "names")
[1] "company" "cash_flow" "year"
my_matrix
and the factor my_factor
are defined for you.
# my_matrix and my_factor
<- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3)
my_matrix rownames(my_matrix) <- c("Row1", "Row2")
colnames(my_matrix) <- c("Col1", "Col2", "Col3")
<- factor(c("A", "A", "B"), ordered = T, levels = c("A", "B")) my_factor
attributes()
on my_matrix
.
# attributes of my_matrix
attributes(my_matrix)
## $dim
## [1] 2 3
##
## $dimnames
## $dimnames[[1]]
## [1] "Row1" "Row2"
##
## $dimnames[[2]]
## [1] "Col1" "Col2" "Col3"
attr()
on my_matrix
to return the “dim”
attribute.
# Just the dim attribute of my_matrix
attr(my_matrix, which = "dim")
## [1] 2 3
attributes()
on my_factor
.
# attributes of my_factor
attributes(my_factor)
## $levels
## [1] "A" "B"
##
## $class
## [1] "ordered" "factor"
5.3 Congratulations!
5.3.1 Congratulations!
From quantitative finance, to machine learning, to geospatial data, the possibilities of what you can do with R are just about endless. My hope is that you take what you learned in this course, and use that knowledge to explore a data set that interests you.
5.3.2 More to learn
From quantitative finance, to machine learning, to geospatial data, the possibilities of what you can do with R are just about endless. My hope is that you take what you learned in this course, and use that knowledge to explore a dataset that interests you.
5.3.3 Keep learning!
We have only just scratched the surface of what R can do, and I hope you will check out some of our other courses to learn much more about it. If you are interested in continuing the Finance curriculum, check out Intermediate R for Finance. I’m happy that you were able to take this course with me, thanks for attending!