Transformed model

As you saw in the previous chapter, transforming the variables can often transform a model from one where the technical conditions are
violated to one where the technical conditions hold. When technical conditions hold, you are able to accurately interpret the inferential
output. In the two models below, note how the standard errors and p-values change (although in both settings the p-value is significant).

* Run a linear regression on price versus bed for the LAhomes dataset, then tidy the output.
* Do the same on log-transformed variables: log(price) versus log(bed).


>
glimpse(LAhomes)
Rows: 1,582
Columns: 9
$ city   <chr> "Long Beach", "Long Beach", "Long Beach", "Long Beach", "Long B…
$ type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ bed    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ bath   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, …
$ garage <chr> NA, NA, NA, NA, NA, NA, NA, NA, "1", "1", "1", "1", "1", "2", "…
$ sqft   <int> 552, 558, 596, 744, 750, 750, 791, 798, 671, 796, 935, 1006, 14…
$ pool   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ spa    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ price  <dbl> 159900, 135000, 105000, 167000, 134900, 145000, 219900, 195000,…

# Create a tidy model
lm(price ~ bed, data = LAhomes) %>% tidy()

# A tibble: 2 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -3104293.   170451.     -18.2 2.00e- 67
2 bed          1573313.    55438.      28.4 1.60e-143

# Create a tidy model using the log of both variables
lm(log(price) ~ log(bed), data = LAhomes) %>% tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    12.0     0.0448     268.  0        
2 log(bed)        1.41    0.0436      32.3 2.04e-176

Notice that the estimate for the size of the effect of bed on price and the p-value are both wildly different between the two models.

--------------------------------------------------------------------------------------------------------------------------------------------

LA Homes, multicollinearity (1)

In the next series of exercises, you will investigate how to interpret the sign (positive or negative) of the slope coefficient as well
as the significance of the variables (p-value). You will continue to use the log transformed variables so that the technical conditions
hold, but you will not be concerned here with the value of the coefficient.

* Run a linear regression on log price versus log sqft for the LAhomes dataset, then tidy the output.
* Look at the output. Is the relationship is positive or negative? Is relationship is significant?


# Output the tidy model
lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     2.70    0.144       18.8 1.97e-71
2 log(sqft)       1.44    0.0195      73.8 0   

The model shows that the area of the home has a significant positive relationship with the price.  

--------------------------------------------------------------------------------------------------------------------------------------------

LA Homes, multicollinearity (2)

Repeat the previous exercise, but this time regress the log transformed variable price on the new variable bath which records
the number of bathrooms in a home.

* Run a linear regression on log price versus log bath for the LAhomes dataset, then tidy the output.
* Look at the output. Is the relationship is positive or negative? Is relationship is significant?


# Output the tidy model
lm(log(price) ~ log(bath), data = LAhomes) %>% tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    12.2     0.0280     437.  0        
2 log(bath)       1.43    0.0306      46.6 9.66e-300

The model shows that the number of bathrooms also has a significant positive relationship with the price of the home.

--------------------------------------------------------------------------------------------------------------------------------------------

LA Homes, multicollinearity (3)

Now, regress the log transformed variable price on the log transformed variables sqft AND bath. The model is a three dimensional linear
regression model where you are predicting price as a plane (think of a piece of paper) above the axes including both sqft and bath.

* Run a tidy lm on the log transformed variables price and both of sqft and bath from the dataset LAhomes. 
  Use the formula: log(price) ~ log(sqft) + log(bath).
* Now look at the coefficients separately. What happened to the signs of each of the coefficients? What happened to the significance 
  of each of the coefficients?


# Output the tidy model
lm(log(price) ~ log(sqft) + log(bath), data = LAhomes) %>% tidy()

# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   2.51      0.262      9.60  2.96e- 21
2 log(sqft)     1.47      0.0395    37.2   1.19e-218
3 log(bath)    -0.0390    0.0453    -0.862 3.89e-  1

The model still predicts that the square footage of the home has a significant positive relationship with price. However, now the number 
of bathrooms no longer has a significant relationship with the price. (Note also that the sign of the coefficient has changed.)

--------------------------------------------------------------------------------------------------------------------------------------------

Inference on coefficients

Using the NYC Italian restaurants dataset (compiled by Simon Sheather in A Modern Approach to Regression with R), restNYC, you will
investigate the effect on the significance of the coefficients when there are multiple variables in the model. Recall, the p-value 
associated with any coefficient is the probability of the observed data given that the particular variable is independent of the
response AND given that all other variables are included in the model.

The following information relates to the dataset restNYC which is loaded into your workspace:

    + each row represents one customer survey from Italian restaurants in NYC
    + Price = price (in US$) of dinner (including tip and one drink)
    + Service = rating of the service (from 1 to 30)
    + Food = rating of the food (from 1 to 30)
    + Decor = rating of the decor (from 1 to 30)

* Run a tidy lm regressing Price on Service.
* Run a tidy lm regressing Price on Service, Food, and Decor.
* What happened to the significance of Service when additional variables were added to the model?

>
glimpse(restNYC)
Rows: 168
Columns: 7
$ Case       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ Restaurant <chr> "Daniella Ristorante", "Tello's Ristorante", "Biricchino", …
$ Price      <int> 43, 32, 34, 41, 54, 52, 34, 34, 39, 44, 45, 47, 52, 35, 47,…
$ Food       <int> 22, 20, 21, 20, 24, 22, 22, 20, 22, 21, 19, 21, 21, 19, 20,…
$ Decor      <int> 18, 19, 13, 20, 19, 22, 16, 18, 19, 17, 17, 19, 19, 17, 18,…
$ Service    <int> 20, 19, 18, 17, 21, 21, 21, 21, 22, 19, 20, 21, 20, 19, 21,…
$ East       <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
 
# Output the first model
lm(Price ~ Service, data = restNYC) %>% tidy()

lm(Price ~ Service, data = restNYC) %>% tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   -12.0      5.11      -2.34 2.02e- 2
2 Service         2.82     0.262     10.8  7.88e-21

# Output the second model
lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy()

# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -24.6       4.75     -5.18  6.33e- 7
2 Service        0.135     0.396     0.341 7.33e- 1
3 Food           1.56      0.373     4.17  4.93e- 5
4 Decor          1.85      0.218     8.49  1.17e-14

When only Service is included in the model, it appears significant. However, once Food and Decor are added into the model, 
that is no longer the case.

What is the correct interpretation of the coefficient on Service in the linear model which regresses Price on Service, Food, and Decor?
Given that Food and Decor are in the model, Service is not significant, and we cannot know whether it has effect on modeling Price.