Modeling with Data in the Tidyverse

Albert Y. Kim - DataCamp


Course Description

In this course, you will learn to model with data. Models attempt to capture the relationship between an outcome variable of interest and a series of explanatory/predictor variables. Such models can be used for both explanatory purposes, e.g. “Does knowing professors’ ages help explain their teaching evaluation scores?”, and predictive purposes, e.g., “How well can we predict a house’s price based on its size and condition?” You will leverage your tidyverse skills to construct and interpret such models. This course centers around the use of linear regression, one of the most commonly-used and easy to understand approaches to modeling. Such modeling and thinking is used in a wide variety of fields, including statistics, causal inference, machine learning, and artificial intelligence.

1 Introduction to Modeling

This chapter will introduce you to some background theory and terminology for modeling, in particular, the general modeling framework, the difference between modeling for explanation and modeling for prediction, and the modeling problem. Furthermore, you’ll start performing your first exploratory data analysis, a crucial first step before any formal modeling.

1.1 Background on modeling for explanation

1.1.1 Exploratory visualization of age

Let’s perform an exploratory data analysis (EDA) of the numerical explanatory variable age. You should always perform an exploratory analysis of your variables before any formal modeling. This will give you a sense of your variable’s distributions, any outliers, and any patterns that might be useful when constructing your eventual model.

  • Using the ggplot2 package, create a histogram of age with bins in 5 year increments.
  • Label the x axis with “age” and the y axis with “count”.
# Load packages
library(moderndive)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
library(ggplot2)

# Plot the histogram
ggplot(evals, aes(x = age)) +
  geom_histogram(binwidth = 5) +
  labs(x = "age", y = "count")

Nice! The 463 instructors’ ages are centered roughly at age 50. You’ll now compute notions of center numerically!

1.1.2 Numerical summaries of age

Let’s continue our exploratory data analysis of the numerical explanatory variable age by computing summary statistics. Summary statistics take many values and summarize them with a single value. Let’s compute three such values using dplyr data wrangling: mean (AKA the average), the median (the middle value), and the standard deviation (a measure of spread/variation).

Calculate the mean, median, and standard deviation of age.

# Load packages
library(moderndive)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Compute summary stats
evals %>%
  summarize(mean_age = mean(age),
            median_age = median(age),
            sd_age = sd(age))
## # A tibble: 1 × 3
##   mean_age median_age sd_age
##      <dbl>      <int>  <dbl>
## 1     48.4         48   9.80

Great! As suggested in the previous histogram for age, the center of the distribution as quantified by the mean and median is around 48 years old!

1.2 Background on modeling for prediction

1.2.1 Exploratory visualization of house size

Let’s create an exploratory visualization of the predictor variable reflecting the size of houses: sqft_living the square footage of living space where 1 sq.foot ≈ 0.1 sq.meter.

  • Create a histogram of sqft_living.
  • Label the x axis with “Size (sq.feet)” and the y axis with “count”.
# Load packages
library(moderndive)
library(ggplot2)

# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
  geom_histogram() +
  labs(x = "Size (sq.feet)", y = "count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After plotting the histogram, what can you say about the distribution of the variable sqft_living?

  • There is no skew.
  • sqft_living is right-skewed.
  • sqft_living is left-skewed.

Correct! A variable is right-skewed if it has a long tail to the right.

1.2.2 Log10 transformation of house size

You just saw that the predictor variable sqft_living is right-skewed and hence a log base 10 transformation is warranted to unskew it. Just as we transformed the outcome variable price to create log10_price in the video, let’s do the same for sqft_living.

  • Using the mutate() function from dplyr, create a new column log10_size and assign it to house_prices_2 by applying a log10() transformation to sqft_living.
# Load packages
library(moderndive)
library(dplyr)
library(ggplot2)

# Add log10_size
house_prices_2 <- house_prices %>%
  mutate(log10_size = log10(sqft_living))

Visualize the effect of the log10() transformation by creating a histogram of the new variable log10_size.

# Plot the histogram  
ggplot(house_prices_2, aes(x = log10_size)) +
  geom_histogram() +
  labs(x = "log10 size", y = "count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Huzzah! Notice how the distribution is much less skewed. Going forward, you’ll use this new transformed variable to represent the size of houses.

1.3 The modeling problem for explanation

1.3.1 EDA of relationship of teaching & "beauty" scores

The researchers in the UT Austin created a “beauty score” by asking a panel of 6 students to rate the “beauty” of all 463 instructors. They were interested in studying any possible impact of “beauty” of teaching evaluation scores. Let’s do an EDA of this variable and its relationship with teaching score.

From now on, assume that ggplot2, dplyr, and moderndive are all available in your workspace unless you’re told otherwise.

Create a histogram of bty_avg “beauty scores” with bins of size 0.5.

# Plot the histogram
ggplot(evals, aes(x = bty_avg)) +
  geom_histogram(binwidth = 0.5) +
  labs(x = "Beauty score", y = "count")

Create a scatterplot with the outcome variable score on the y-axis and the explanatory variable bty_avg on the x-axis.

# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "teaching score")

Let’s now investigate if this plot suffers from overplotting, whereby points are stacked perfectly on top of each other, obscuring the number of points involved. You can do this by jittering the points. Update the code accordingly!

# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  labs(x = "beauty score", y = "teaching score")

“Beauty”-ful! It seems the original scatterplot did suffer from overplotting since the jittered scatterplot reveals many originally hidden points. Most bty_avg scores range from 2-8, with 5 being about the center.

1.3.2 Correlation between teaching and "beauty" scores

Let’s numerically summarize the relationship between teaching score and beauty score bty_avg using the correlation coefficient.

Compute the correlation coefficient of score and bty_avg.

# Compute correlation
evals %>%
  summarize(correlation = cor(score, bty_avg))
## # A tibble: 1 × 1
##   correlation
##         <dbl>
## 1       0.187

Based on this, what can you say about the relationship between these two variables?

  • score and bty_avg are strongly negatively associated.
  • score and bty_avg are weakly negatively associated.
  • score and bty_avg are weakly positively associated.
  • score and bty_avg are strongly positively associated.
    2

Correct! While there seems to be a positive relationship, +0.187 is still a long ways from +1, so the correlation is only weakly positive.

1.4 The modeling problem for prediction

1.4.1 EDA of relationship of house price and waterfront

Let’s now perform an exploratory data analysis of the relationship between log10_price, the log base 10 house price, and the binary variable waterfront. Let’s look at the raw values of waterfront and then visualize their relationship.

The column log10_price has been added for you in the house_prices dataset.

  • Use glimpse() to view the structure of only two columns: log10_price and waterfront.
house_prices=house_prices %>% mutate(log10_price=log10(price),log10_size=log10(sqft_living))
# View the structure of log10_price and waterfront
house_prices %>%
  select(log10_price, waterfront) %>%
  glimpse()
## Rows: 21,613
## Columns: 2
## $ log10_price <dbl> 5.346157, 5.730782, 5.255273, 5.781037, 5.707570, 6.088136…
## $ waterfront  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

Visualize the relationship between waterfront and log10_price using an appropriate geom_* function. Remember that waterfront is categorical.

# Plot
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
  geom_boxplot() +
  labs(x = "waterfront", y = "log10 price")

A+! Look at that boxplot! Houses that have a view of the waterfront tend to be MUCH more expensive as evidenced by the much higher log10 prices!

1.4.2 Predicting house price with waterfront

You just saw that houses with a view of the waterfront tend to be much more expensive. But by how much? Let’s compute group means of log10_price, convert them back to dollar units, and compare!

The variable log10_price has already been added to house_prices for you.

Return both the mean of log10_price and the count of houses in each level of waterfront.

# Calculate stats
house_prices %>%
  group_by(waterfront) %>%
  summarize(mean_log10_price = mean(log10_price), n = n())
## # A tibble: 2 × 3
##   waterfront mean_log10_price     n
##   <lgl>                 <dbl> <int>
## 1 FALSE                  5.66 21450
## 2 TRUE                   6.12   163

Using these group means for log10_price, return “good” predicted house prices in the original units of US dollars.

# Prediction of price for houses without view of waterfront
10^(5.66)
## [1] 457088.2
# Prediction of price for houses with view of waterfront
10^(6.12)
## [1] 1318257

100%! Most houses don’t have a view of the waterfront (n = 21,450), but those that do (n = 163) have a MUCH higher predicted price. Look at that difference! $457,088 versus $1,318,257! In the upcoming Chapter 2 on basic regression, we’ll build on such intuition and construct our first formal explanatory and predictive models using basic regression!

2 Modeling with Basic Regression

Equipped with your understanding of the general modeling framework, in this chapter, we’ll cover basic linear regression where you’ll keep things simple and model the outcome variable y as a function of a single explanatory/ predictor variable x. We’ll use both numerical and categorical x variables. The outcome variable of interest in this chapter will be teaching evaluation scores of instructors at the University of Texas, Austin.

2.1 Explaining teaching score with age

2.1.1 Plotting a "best-fitting" regression line

Previously you visualized the relationship of teaching score and “beauty score” via a scatterplot. Now let’s add the “best-fitting” regression line to provide a sense of any overall trends. Even though you know this plot suffers from overplotting, you’ll stick to the non-jitter version.

Add a regression line without the error bars to the scatterplot.

# Load packages
library(ggplot2)
library(dplyr)
library(moderndive)

# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "score") +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Fantastic! The overall trend seems to be positive! As instructors have higher “beauty” scores, so also do they tend to have higher teaching scores.

2.1.2 Fitting a regression with a numerical x

Let’s now explicity quantify the linear relationship between score and bty_avg using linear regression. You will do this by first “fitting” the model. Then you will get the regression table, a standard output in many statistical software packages.

Fit a linear regression model between score and average beauty using the lm() function and save this model to model_score_2.

# Load package
library(moderndive)

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output content
model_score_2
## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Coefficients:
## (Intercept)      bty_avg  
##     3.88034      0.06664

Given the sparsity of the output, let’s get the regression table using the get_regression_table() wrapper function.

# Output regression table
get_regression_table(model_score_2)
## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

Based on the output of get_regression_table(), which interpretation of the slope coefficient is correct?

  • For every person who has a beauty score of one, their teaching score will be 0.0670.
  • For every increase of one in beauty score, you should observe an associated increase of on average 0.0670 units in teaching score.
  • Less “beautiful” instructors tend to get higher teaching evaluation scores.

Correct! As suggested by your exploratory visualization, there is a positive relationship between “beauty” and teaching score.

2.2 Predicting teaching score using age

2.2.1 Making predictions using "beauty score"

Say there is an instructor at UT Austin and you know nothing about them except that their beauty score is 5. What is your prediction y^ of their teaching score y?

get_regression_table(model_score_2)
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    3.88      0.076     51.0        0    3.73     4.03 
2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
  • Using the values of the intercept and slope from above, predict this instructor’s score.
# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 5 * 0.0670
y_hat
## [1] 4.215

Say it’s revealed that the instructor’s score is 4.7. Compute the residual for this prediction, i.e., the residual yy^.

# Compute residual y - y_hat
4.7 - 4.215
## [1] 0.485

Awesome! Was your visual guess close to the predicted teaching score of 4.215? Also, note that this prediction is off by about 0.485 units in teaching score.

2.2.2 Computing fitted/predicted values & residuals

Now say you want to repeat this for all 463 instructors in evals. Doing this manually as you just did would be long and tedious, so as seen in the video, let’s automate this using the get_regression_points() function.

Furthemore, let’s unpack its output.

  • Let’s once again get the regression table for model_score_2.
  • # Fit regression model
    model_score_2 <- lm(score ~ bty_avg, data = evals)
    
    # Get regression table
    get_regression_table(model_score_2)
    ## # A tibble: 2 × 7
    ##   term      estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
    ## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099
  • Apply get_regression_points() from the moderndive package to automate making predictions and computing residuals.
  • # Get all fitted/predicted values and residuals
    get_regression_points(model_score_2)
    ## # A tibble: 463 × 5
    ##       ID score bty_avg score_hat residual
    ##    <int> <dbl>   <dbl>     <dbl>    <dbl>
    ##  1     1   4.7    5         4.21    0.486
    ##  2     2   4.1    5         4.21   -0.114
    ##  3     3   3.9    5         4.21   -0.314
    ##  4     4   4.8    5         4.21    0.586
    ##  5     5   4.6    3         4.08    0.52 
    ##  6     6   4.3    3         4.08    0.22 
    ##  7     7   2.8    3         4.08   -1.28 
    ##  8     8   4.1    3.33      4.10   -0.002
    ##  9     9   3.4    3.33      4.10   -0.702
    ## 10    10   4.5    3.17      4.09    0.409
    ## # … with 453 more rows
    • Let’s unpack the contents of the score_hat column. First, run the code that fits the model and outputs the regression table.
    • Add a new column score_hat_2 which replicates how score_hat is computed using the table’s values.
    # Get all fitted/predicted values and residuals
    get_regression_points(model_score_2) %>% 
      mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)
    ## # A tibble: 463 × 6
    ##       ID score bty_avg score_hat residual score_hat_2
    ##    <int> <dbl>   <dbl>     <dbl>    <dbl>       <dbl>
    ##  1     1   4.7    5         4.21    0.486        4.22
    ##  2     2   4.1    5         4.21   -0.114        4.22
    ##  3     3   3.9    5         4.21   -0.314        4.22
    ##  4     4   4.8    5         4.21    0.586        4.22
    ##  5     5   4.6    3         4.08    0.52         4.08
    ##  6     6   4.3    3         4.08    0.22         4.08
    ##  7     7   2.8    3         4.08   -1.28         4.08
    ##  8     8   4.1    3.33      4.10   -0.002        4.10
    ##  9     9   3.4    3.33      4.10   -0.702        4.10
    ## 10    10   4.5    3.17      4.09    0.409        4.09
    ## # … with 453 more rows
    • Now let’s unpack the contents of the residual column. First, run the code that fits the model and outputs the regression table.
    • Add a new column residual_2 which replicates how residual is computed using the table’s values.
    # Get all fitted/predicted values and residuals
    get_regression_points(model_score_2) %>% 
      mutate(residual_2 = score - score_hat)
    ## # A tibble: 463 × 6
    ##       ID score bty_avg score_hat residual residual_2
    ##    <int> <dbl>   <dbl>     <dbl>    <dbl>      <dbl>
    ##  1     1   4.7    5         4.21    0.486    0.486  
    ##  2     2   4.1    5         4.21   -0.114   -0.114  
    ##  3     3   3.9    5         4.21   -0.314   -0.314  
    ##  4     4   4.8    5         4.21    0.586    0.586  
    ##  5     5   4.6    3         4.08    0.52     0.520  
    ##  6     6   4.3    3         4.08    0.22     0.220  
    ##  7     7   2.8    3         4.08   -1.28    -1.28   
    ##  8     8   4.1    3.33      4.10   -0.002   -0.00200
    ##  9     9   3.4    3.33      4.10   -0.702   -0.702  
    ## 10    10   4.5    3.17      4.09    0.409    0.409  
    ## # … with 453 more rows

    Bingo! You’ll see later that the residuals can provide useful information about the quality of your regression models. Stay tuned!

    2.3 Explaining teaching score with gender

    2.3.1 EDA of relationship of score and rank

    Let’s perform an EDA of the relationship between an instructor’s score and their rank in the evals dataset. You’ll both visualize this relationship and compute summary statistics for each level of rank: teaching, tenure track, and tenured.

    Write the code to create a boxplot of the relationship between teaching score and rank.

    ggplot(evals, aes(x = rank, y = score)) +
      geom_boxplot() +
      labs(x = "rank", y = "score")

    • For each unique value in rank:
      • Count the number of observations in each group
      • Find the mean and standard deviation of score
    evals %>%
      group_by(rank) %>%
      summarize(n = n(), mean_score = mean(score), sd_score = sd(score))
    ## # A tibble: 3 × 4
    ##   rank             n mean_score sd_score
    ##   <fct>        <int>      <dbl>    <dbl>
    ## 1 teaching       102       4.28    0.498
    ## 2 tenure track   108       4.15    0.561
    ## 3 tenured        253       4.14    0.550

    Cool! The boxplot and summary statistics suggest that teaching get the highest scores while tenured professors get the lowest. However, there is clearly variation around the respective means.

    2.3.2 Fitting a regression with a categorical x

    You’ll now fit a regression model with the categorical variable rank as the explanatory variable and interpret the values in the resulting regression table. Note here the rank “teaching” is treated as the baseline for comparison group for the “tenure track” and “tenured” groups.

    Fit a linear regression model between score and rank, and then apply the wrapper function to model_score_4 that returns the regression table.

    # Fit regression model
    model_score_4 <- lm(score ~ rank, data = evals)
    
    # Get regression table
    get_regression_table(model_score_4)
    ## # A tibble: 3 × 7
    ##   term               estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept             4.28      0.054     79.9    0        4.18     4.39 
    ## 2 rank: tenure track   -0.13      0.075     -1.73   0.084   -0.277    0.017
    ## 3 rank: tenured        -0.145     0.064     -2.28   0.023   -0.27    -0.02

    Based on the regression table, compute the 3 possible fitted values y^, which are the group means. Since “teaching” is the baseline for comparison group, the intercept is the mean score for the “teaching” group and ranktenure track/ranktenured are relative offsets to this baseline for the “tenure track”/“tenured” groups.

    # teaching mean
    teaching_mean <- 4.28
    
    # tenure track mean
    tenure_track_mean <- 4.28 - 0.130
    
    # tenured mean
    tenured_mean <- 4.28 - 0.145

    Kudos! Remember that regressions with a categorical variable return group means expressed relative to a baseline for comparison!

    2.4 Predicting teaching score using gender

    2.4.1 Making predictions using rank

    Run get_regression_table(model_score_4) in the console to regenerate the regression table where you modeled score as a function of rank. Now say using this table you want to predict the teaching score of an instructor about whom you know nothing except that they are a tenured professor. Which of these statements is true?

    • A good prediction of their score would be 4.28 - 0.145 = 4.135.
    • A good prediction of their score would be -0.145.
    • There is no information in the table that can aid your prediction.

    Yay! Regression tables for categorical explanatory variables show differences in means relative to a baseline.

    2.4.2 Visualizing the distribution of residuals

    Let’s now compute both the predicted score y^ and the residual yy^ for all n=463 instructors in the evals dataset. Furthermore, you’ll plot a histogram of the residuals and see if there are any patterns to the residuals, i.e. your predictive errors.

    model_score_4 from the previous exercise is available in your workspace.

    • Apply the function that automates making predictions and computing residuals, and save these values to the dataframe model_score_4_points.
    # Calculate predictions and residuals
    model_score_4_points <- get_regression_points(model_score_4)
    model_score_4_points
    ## # A tibble: 463 × 5
    ##       ID score rank         score_hat residual
    ##    <int> <dbl> <fct>            <dbl>    <dbl>
    ##  1     1   4.7 tenure track      4.16    0.545
    ##  2     2   4.1 tenure track      4.16   -0.055
    ##  3     3   3.9 tenure track      4.16   -0.255
    ##  4     4   4.8 tenure track      4.16    0.645
    ##  5     5   4.6 tenured           4.14    0.461
    ##  6     6   4.3 tenured           4.14    0.161
    ##  7     7   2.8 tenured           4.14   -1.34 
    ##  8     8   4.1 tenured           4.14   -0.039
    ##  9     9   3.4 tenured           4.14   -0.739
    ## 10    10   4.5 tenured           4.14    0.361
    ## # … with 453 more rows

    Now take the model_score_4_points dataframe to plot a histogram of the residual column so you can see the distribution of the residuals, i.e., the prediction errors.

    # Plot residuals
    ggplot(model_score_4_points, aes(x = residual)) +
      geom_histogram() +
      labs(x = "residuals", title = "Residuals from score ~ rank model")
    ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

    Congrats! Look at the distribution of the residuals. While it seems there are fewer negative residuals corresponding to overpredictions of score, the magnitude of the error seems to be larger (ranging all the way to -2).

    3 Modeling with Multiple Regression

    In the previous chapter, you learned about basic regression using either a single numerical or a categorical predictor. But why limit ourselves to using only one variable to inform your explanations/predictions? You will now extend basic regression to multiple regression, which allows for incorporation of more than one explanatory or one predictor variable in your models. You’ll be modeling house prices using a dataset of houses in the Seattle, WA metropolitan area.

    3.1 Explaining house price with year & size

    3.1.1 EDA of relationship

    Unfortunately, making 3D scatterplots to perform an EDA is beyond the scope of this course. So instead let’s focus on making standard 2D scatterplots of the relationship between price and the number of bedrooms, keeping an eye out for outliers.

    The log10 transformations have been made for you and are saved in house_prices.

    Complete the ggplot() code to create a scatterplot of log10_price over bedrooms along with the best-fitting regression line.

    # Create scatterplot with regression line
    ggplot(house_prices, aes(x = bedrooms, y = log10_price)) +
      geom_point() +
      labs(x = "Number of bedrooms", y = "log10 price") +
      geom_smooth(method = "lm", se = FALSE)
    ## `geom_smooth()` using formula 'y ~ x'

    • There is one house that has 33 bedrooms. While this could truly be the case, given the number of bedrooms in the other houses, this is probably an outlier.
    • Remove this outlier using filter() to recreate the plot.
    # Remove outlier
    house_prices_transform <- house_prices %>% 
      filter(bedrooms < 33)
    
    # Create scatterplot with regression line
    ggplot(house_prices_transform, aes(x = bedrooms, y = log10_price)) +
      geom_point() +
      labs(x = "Number of bedrooms", y = "log10 price") +
      geom_smooth(method = "lm", se = FALSE)
    ## `geom_smooth()` using formula 'y ~ x'

    Excellent! Another important reason to perform EDA is to discard any potential outliers that are likely data entry errors. In our case, after removing an outlier, you can see a clear positive relationship between the number of bedrooms and price, as one would expect.

    3.1.2 Fitting a regression

    house_prices, which is available in your environment, has the log base 10 transformed variables included and the outlier house with 33 bedrooms removed. Let’s fit a multiple regression model of price as a function of size and the number of bedrooms and generate the regression table. In this exercise, you will first fit the model, and based on the regression table, in the second part, you will answer the following question:

  • Fit a linear model lm with log10_price as a function of log10_size and bedrooms.
  • # Fit model
    model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                        data = house_prices)
  • Print the regression table.
  • # Get regression table
    get_regression_table(model_price_2)
    ## # A tibble: 3 × 7
    ##   term       estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept     2.72      0.023     118.        0    2.67     2.76 
    ## 2 log10_size    0.931     0.008     118.        0    0.916    0.947
    ## 3 bedrooms     -0.03      0.002     -19.3       0   -0.033   -0.027

    Which of these interpretations of the slope coefficent for bedrooms is correct?

    • Every extra bedroom is associated with a decrease of on average 0.033 in log10_price.
    • Accounting for log10_size, every extra bedroom is associated with a decrease of on average 0.033 in log10_price.

    Splendid! In this multiple regression setting, the associated effect of any variable must be viewed in light of the other variables in the model. In our case, accounting for the size of the house reverses the relationship of the number of bedrooms and price from positive to negative!

    3.2 Predicting house price using year & size

    3.2.1 Making predictions using size and bedrooms

    Say you want to predict the price of a house using this model and you know it has:

    • 1000 square feet of living space, and
    • 3 bedrooms

    What is your prediction both in log10 dollars and then dollars?

    The regression model from the previous exercise is available in your workspace as model_price_2.

    get_regression_table(model_price_2)
    # A tibble: 3 x 7
      term       estimate std_error statistic p_value lower_ci upper_ci
      <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    1 intercept     2.69      0.023     116.        0    2.65     2.74 
    2 log10_size    0.941     0.008     118.        0    0.925    0.957
    3 bedrooms     -0.033     0.002     -20.5       0   -0.036   -0.03 
    
    • Using the fitted values of the intercept and slopes from the regression table on the left to predict this house’s price in log10 dollars.
    # Make prediction in log10 dollars
    2.69 + 0.941 * log10(1000) - 0.033 * 3
    ## [1] 5.414

    Now predict this house’s price in dollars.

    # Make prediction dollars
    10^(2.69 + 0.941 * log10(1000) - 0.033 * 3)
    ## [1] 259417.9

    Spot on! Using the values in the regression table you can make predictions of house prices! In this case, your prediction is about $260,000. Let’s now apply this procedure to all 21k houses!

    3.2.2 Interpreting residuals

    Let’s automate this process for all 21K rows in house_prices to obtain residuals, which you’ll use to compute the sum of squared residuals: a measure of the lack of fit of a model. After computing the sum of squared residuals, you will answer the following question:

    Apply the relevant wrapper function to automate computation of fitted/predicted values and hence also residuals for all 21K houses using model_price_2.

    # Automate prediction and residual computation
    get_regression_points(model_price_2)
    ## # A tibble: 21,613 × 6
    ##       ID log10_price log10_size bedrooms log10_price_hat residual
    ##    <int>       <dbl>      <dbl>    <int>           <dbl>    <dbl>
    ##  1     1        5.35       3.07        3            5.48   -0.139
    ##  2     2        5.73       3.41        3            5.8    -0.069
    ##  3     3        5.26       2.89        2            5.34   -0.087
    ##  4     4        5.78       3.29        4            5.66    0.121
    ##  5     5        5.71       3.22        3            5.63    0.08 
    ##  6     6        6.09       3.73        4            6.07    0.017
    ##  7     7        5.41       3.23        3            5.64   -0.225
    ##  8     8        5.46       3.02        3            5.44    0.024
    ##  9     9        5.36       3.25        3            5.65   -0.29 
    ## 10    10        5.51       3.28        3            5.68   -0.166
    ## # … with 21,603 more rows

    Compute the sum of squared residuals using dplyr commands.

    # Automate prediction and residual computation
    get_regression_points(model_price_2) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(sum_sq_residuals = sum(sq_residuals))
    ## # A tibble: 1 × 1
    ##   sum_sq_residuals
    ##              <dbl>
    ## 1             605.

    Which of these statements about residuals is incorrect?

    • The residual is the observed outcome variable minus the predicted variable.
    • Residuals are leftover points not accounted for in the our regression model.
    • They can be thought of as prediction errors.
    • They can be thought of as the lack-of-fit of the predictions to truth.

    Good job! Residual does suggest ‘leftovers’, but not in the sense that they are leftover points.

    3.3 Explaining house price with size & condition

    3.3.1 Parallel slopes model

    Let’s now fit a “parallel slopes” model with the numerical explanatory/predictor variable log10_size and the categorical, in this case binary, variable waterfront. The visualization corresponding to this model is below:

    Fit a multiple regression of log10_price using log10_size and waterfront as the predictors. Recall that the data frame that contains these variables is house_prices.

    # Fit model
    model_price_4 <- lm(log10_price ~ log10_size + waterfront,
                        data = house_prices)
    
    # Get regression table
    get_regression_table(model_price_4)
    ## # A tibble: 3 × 7
    ##   term           estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept         2.96      0.02      146.        0    2.92     3.00 
    ## 2 log10_size        0.825     0.006     134.        0    0.813    0.837
    ## 3 waterfrontTRUE    0.322     0.013      24.5       0    0.296    0.348

    Success! Notice how the regression table has three rows: intercept, the slope for log10_size, and an offset for houses that do have a waterfront.

    3.3.2 Interpreting the parallel slopes model

    Let’s interpret the values in the regression table for the parallel slopes model you just fit. Run get_regression_table(model_price_4) in the console to view the regression table again. The visualization for this model is below. Which of these interpretations is incorrect?

    get_regression_table(model_price_4)
    ## # A tibble: 3 × 7
    ##   term           estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept         2.96      0.02      146.        0    2.92     3.00 
    ## 2 log10_size        0.825     0.006     134.        0    0.813    0.837
    ## 3 waterfrontTRUE    0.322     0.013      24.5       0    0.296    0.348
    • The intercept for houses with a view of the waterfront is 3.282.
    • All houses are assumed to have the same slope between log10_price & log10_size.
    • The intercept for houses without a view of the waterfront is 2.96.
    • The intercept for houses with a view of the waterfront is 0.322.

    Right! 0.322 is the offset in the intercept for houses with a view of the waterfront relative to those which don’t.

    3.4 Predicting house price using size & condition

    3.4.1 Making predictions using size and waterfront

    Using your model for log10_price as a function of log10_size and the binary variable waterfront, let’s make some predictions! Say you have the two following “new” houses, what would you predict their prices to be in dollars?

    • House A: log10_size = 2.9 that has a view of the waterfront
    • House B: log10_size = 3.1 that does not have a view of the waterfront

    We make the corresponding visual predictions below:

    After running the code on line 2 to get the regression table based on model_price_4, compute the predicted prices for both houses. First you’ll use an equation based on values in this regression table to get a predicted value in log10 dollars, then raise 10 to this predicted value to get a predicted value in dollars.

    # Get regression table
    get_regression_table(model_price_4)
    ## # A tibble: 3 × 7
    ##   term           estimate std_error statistic p_value lower_ci upper_ci
    ##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
    ## 1 intercept         2.96      0.02      146.        0    2.92     3.00 
    ## 2 log10_size        0.825     0.006     134.        0    0.813    0.837
    ## 3 waterfrontTRUE    0.322     0.013      24.5       0    0.296    0.348
    # Prediction for House A
    10^(2.96 + 0.825 * 2.9 + 0.322)
    ## [1] 472606.8
    # Prediction for House B
    10^(2.96 + 0.825 * 3.1)
    ## [1] 329230.5

    Yay! Your modeling toolbox is getting quite extensive! Let’s now automate this!

    3.5 Automating predictions on "new" houses

    Let’s now repeat what you did in the last exercise, but in an automated fashion assuming the information on these “new” houses is saved in a dataframe.

    Your model for log10_price as a function of log10_size and the binary variable waterfront (model_price_4) is available in your workspace, and so is new_houses_2, a dataframe with data on 2 new houses. While not so beneficial with only 2 “new” houses, this will save a lot of work if you had 2000 “new” houses.

    Apply get_regression_points() as you would normally, but with the newdata argument set to our two “new” houses. This returns predicted values for just those houses.

    new_houses_2=tibble(log10_size=c(2.9,3.1),waterfront=c(TRUE,FALSE))
    # View the "new" houses
    new_houses_2
    ## # A tibble: 2 × 2
    ##   log10_size waterfront
    ##        <dbl> <lgl>     
    ## 1        2.9 TRUE      
    ## 2        3.1 FALSE
    # Get predictions on "new" houses
    get_regression_points(model_price_4, newdata = new_houses_2)
    ## # A tibble: 2 × 4
    ##      ID log10_size waterfront log10_price_hat
    ##   <int>      <dbl> <lgl>                <dbl>
    ## 1     1        2.9 TRUE                  5.67
    ## 2     2        3.1 FALSE                 5.52

    Now take these two predictions in log10_price_hat and return a new column, price_hat, consisting of fitted price in dollars.

    # Get predictions price_hat in dollars on "new" houses
    get_regression_points(model_price_4, newdata = new_houses_2) %>% 
      mutate(price_hat = 10^log10_price_hat)
    ## # A tibble: 2 × 5
    ##      ID log10_size waterfront log10_price_hat price_hat
    ##   <int>      <dbl> <lgl>                <dbl>     <dbl>
    ## 1     1        2.9 TRUE                  5.67   472063.
    ## 2     2        3.1 FALSE                 5.52   328095.

    Predictions of $472,000 and $328,000! Exceptional! You’re done with the multiple regression chapter, and now you’re onto model assessment and selection!

    4 Model Assessment and Selection

    In the previous chapters, you fit various models to explain or predict an outcome variable of interest. However, how do we know which models to choose? Model assessment measures allow you to assess how well an explanatory model “fits” a set of data or how accurate a predictive model is. Based on these measures, you’ll learn about criteria for determining which models are “best”.

    4.1 Model selection and assessment

    4.1.1 Refresher: sum of squared residuals

    Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.

  • Use the appropriate function to get a dataframe with the residuals for model_price_2.
  • # Model 2
    model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                        data = house_prices)
  • Add a new column of squared residuals called sq_residuals.
  • Then summarize sq_residuals with their sum. Call this sum sum_sq_residuals.
  • # Calculate squared residuals
    get_regression_points(model_price_2) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(sum_sq_residuals = sum(sq_residuals))
    ## # A tibble: 1 × 1
    ##   sum_sq_residuals
    ##              <dbl>
    ## 1             605.

    Compute the sum of squared residuals for model_price_4 which uses the categorical variable waterfront instead of the numerical variable bedrooms.

    # Model 4
    model_price_4 <- lm(log10_price ~ log10_size + waterfront, 
                        data = house_prices)
    
    # Calculate squared residuals
    get_regression_points(model_price_4) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(sum_sq_residuals = sum(sq_residuals))
    ## # A tibble: 1 × 1
    ##   sum_sq_residuals
    ##              <dbl>
    ## 1             599.

    Wonderful! Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!

    4.1.2 Which model to select?

    Based on these two values of the sum of squared residuals, which of these two models do you think is “better”, and hence which would you select?

    • model_price_2 that uses log10_size and bedrooms?
    • model_price_4 that uses log10_size and waterfront?
    No information about which is better is provided whatsoever.
    • Since model_price_2’s value was 605, select this one.
    • Since model_price_4’s value was 599, select this one.

    Correct! Given the choice of just these two models, the evidence suggests using size and waterfront yield a better fit, so you should choose this one!

    4.2 Assessing model fit with R-squared

    4.2.1 Computing the R-squared of a model

    Let’s compute the R2 summary value for the two numerical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and the number of bedrooms.

    Recall that R2 can be calculated as:

    1Var(residuals)Var(y)

    Compute R2 by summarizing the residual and log10_price columns.

    # Fit model
    model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
                        data = house_prices)
                        
    # Get fitted/values & residuals, compute R^2 using residuals
    get_regression_points(model_price_2) %>%
      summarize(r_squared = 1 - var(residual) / var(log10_price))
    ## # A tibble: 1 × 1
    ##   r_squared
    ##       <dbl>
    ## 1     0.465

    Nice job! You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.

    4.2.2 Comparing the R-squared of two models

    Let’s now compute R2 for the one numerical and one categorical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and whether the house had a view of the waterfront.

    Compute R2 for model_price_4.

    # Fit model
    model_price_4 <- lm(log10_price ~ log10_size + waterfront,
                        data = house_prices)
    
    # Get fitted/values & residuals, compute R^2 using residuals
    get_regression_points(model_price_4) %>%
      summarize(r_squared = 1 - var(residual) / var(log10_price))
    ## # A tibble: 1 × 1
    ##   r_squared
    ##       <dbl>
    ## 1     0.470

    And compare its R2 with the one you just computed

    • Since model_price_2 had a lower R2 of 0.465, it “fit” the data better.
    • Since model_price_4 had a higher R2 of 0.470, it “fit” the data better.
    • R2 doesn’t tell us anything about quality of model “fit”.

    Correct! Since using waterfront explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.

    4.3 Assessing predictions with RMSE

    4.3.1 Computing the MSE & RMSE of a model

    Just as you did earlier with R2, which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of preditive error. Let’s use the model of price as a function of size and number of bedrooms.

    The model is available in your workspace as model_price_2.

    Let’s start by computing the mean squared error (mse), which is the mean of the squared residual.

    # Get all residuals, square them, and take mean                    
    get_regression_points(model_price_2) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(mse = mean(sq_residuals))
    ## # A tibble: 1 × 1
    ##      mse
    ##    <dbl>
    ## 1 0.0280

    Now that you’ve computed the mean squared error, let’s compute the root mean squared error.

    # Get all residuals, square them, take the mean and square root               
    get_regression_points(model_price_2) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(mse = mean(sq_residuals)) %>%
      mutate(rmse = sqrt(mse))
    ## # A tibble: 1 × 2
    ##      mse  rmse
    ##    <dbl> <dbl>
    ## 1 0.0280 0.167

    Woo hoo! The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.

    4.3.2 Comparing the RMSE of two models

    As you did using the sum of squared residuals and R2, let’s once again assess. Note that RMSE is more typically used in prediction settings than explanatory settings.

    model_price_2 and model_price_4 are available in your workspace.

    Based on the code provided that computes MSE and RMSE for model_price_2, compute the MSE and RMSE for model_price_4.

    # MSE and RMSE for model_price_2
    get_regression_points(model_price_2) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
    ## # A tibble: 1 × 2
    ##      mse  rmse
    ##    <dbl> <dbl>
    ## 1 0.0280 0.167
    # MSE and RMSE for model_price_4
    get_regression_points(model_price_4) %>%
      mutate(sq_residuals = residual^2) %>%
      summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))
    ## # A tibble: 1 × 2
    ##      mse  rmse
    ##    <dbl> <dbl>
    ## 1 0.0277 0.166

    And compare the quality of your two models using the root mean squared error (RMSE)

    • Since model_price_2 had a higher rmse of 0.167, this is suggestive that this model has better preditive power.
    • rmse doesn’t tell us anything about predictive power.
    • Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better preditive power.
    Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better preditive power.

    4.4 Validation set prediction framework

    4.4.1 Fitting model to training data

    It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices without replacement and assign this to house_prices_shuffled. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.

    • Use slice() to set train to the first 10,000 rows of house_prices_shuffled and test to the remainder of the 21,613 rows.
    # Set random number generator seed value for reproducibility
    set.seed(76)
    
    # Randomly reorder the rows
    house_prices_shuffled <- house_prices %>% 
      sample_frac(size = 1, replace = FALSE)
    
    # Train/test split
    train <- house_prices_shuffled %>%
      slice(1:10000)
    test <- house_prices_shuffled %>%
      slice(10001:21613)

    Now fit a linear regression to predict log10_price using log10_size and bedrooms using just the training data.

    # Fit model to training set
    train_model_2 <- lm(log10_price ~ log10_size + bedrooms, data = train)

    Fabulous! Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!

    4.4.2 Predicting on test data

    Now that you’ve trained the model on the train set, let’s apply the model to the test data, make predictions, and evaluate the predictions. Recall that having a separate test set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.

    The datasets train and test, and the trained model, train_model_2 are available in your workspace.

    • Use the get_regression_points() function to apply train_model_2 on your new dataset: test.
    # Make predictions on test set
    get_regression_points(train_model_2, newdata = test)
    ## # A tibble: 11,613 × 6
    ##       ID log10_price log10_size bedrooms log10_price_hat residual
    ##    <int>       <dbl>      <dbl>    <int>           <dbl>    <dbl>
    ##  1     1        5.81       3.24        3            5.64    0.171
    ##  2     2        5.45       3.25        3            5.65   -0.204
    ##  3     3        5.69       3.38        3            5.78   -0.089
    ##  4     4        5.50       2.98        3            5.40    0.106
    ##  5     5        5.93       3.55        4            5.90    0.031
    ##  6     6        5.39       3.25        3            5.65   -0.267
    ##  7     7        5.61       3.06        2            5.51    0.101
    ##  8     8        5.38       3.12        2            5.56   -0.182
    ##  9     9        6.28       3.49        4            5.84    0.44 
    ## 10    10        5.85       3.53        3            5.91   -0.058
    ## # … with 11,603 more rows

    Compute the root mean square error using this output.

    # Compute RMSE
    get_regression_points(train_model_2, newdata = test) %>% 
      mutate(sq_residuals = residual^2) %>%
      summarize(rmse = sqrt(mean(sq_residuals)))
    ## # A tibble: 1 × 1
    ##    rmse
    ##   <dbl>
    ## 1 0.167

    Magnificent! Your RMSE using size and condition as predictor variables is 0.167, which is higher than 0.165 when you used size and year built! It seems the latter is marginally better!