Transformed model As you saw in the previous chapter, transforming the variables can often transform a model from one where the technical conditions are violated to one where the technical conditions hold. When technical conditions hold, you are able to accurately interpret the inferential output. In the two models below, note how the standard errors and p-values change (although in both settings the p-value is significant). * Run a linear regression on price versus bed for the LAhomes dataset, then tidy the output. * Do the same on log-transformed variables: log(price) versus log(bed). > glimpse(LAhomes) Rows: 1,582 Columns: 9 $ city "Long Beach", "Long Beach", "Long Beach", "Long Beach", "Long B… $ type NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ bed 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ bath 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, … $ garage NA, NA, NA, NA, NA, NA, NA, NA, "1", "1", "1", "1", "1", "2", "… $ sqft 552, 558, 596, 744, 750, 750, 791, 798, 671, 796, 935, 1006, 14… $ pool NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ spa NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ price 159900, 135000, 105000, 167000, 134900, 145000, 219900, 195000,… # Create a tidy model lm(price ~ bed, data = LAhomes) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -3104293. 170451. -18.2 2.00e- 67 2 bed 1573313. 55438. 28.4 1.60e-143 # Create a tidy model using the log of both variables lm(log(price) ~ log(bed), data = LAhomes) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 12.0 0.0448 268. 0 2 log(bed) 1.41 0.0436 32.3 2.04e-176 Notice that the estimate for the size of the effect of bed on price and the p-value are both wildly different between the two models. -------------------------------------------------------------------------------------------------------------------------------------------- LA Homes, multicollinearity (1) In the next series of exercises, you will investigate how to interpret the sign (positive or negative) of the slope coefficient as well as the significance of the variables (p-value). You will continue to use the log transformed variables so that the technical conditions hold, but you will not be concerned here with the value of the coefficient. * Run a linear regression on log price versus log sqft for the LAhomes dataset, then tidy the output. * Look at the output. Is the relationship is positive or negative? Is relationship is significant? # Output the tidy model lm(log(price) ~ log(sqft), data = LAhomes) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 2.70 0.144 18.8 1.97e-71 2 log(sqft) 1.44 0.0195 73.8 0 The model shows that the area of the home has a significant positive relationship with the price. -------------------------------------------------------------------------------------------------------------------------------------------- LA Homes, multicollinearity (2) Repeat the previous exercise, but this time regress the log transformed variable price on the new variable bath which records the number of bathrooms in a home. * Run a linear regression on log price versus log bath for the LAhomes dataset, then tidy the output. * Look at the output. Is the relationship is positive or negative? Is relationship is significant? # Output the tidy model lm(log(price) ~ log(bath), data = LAhomes) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 12.2 0.0280 437. 0 2 log(bath) 1.43 0.0306 46.6 9.66e-300 The model shows that the number of bathrooms also has a significant positive relationship with the price of the home. -------------------------------------------------------------------------------------------------------------------------------------------- LA Homes, multicollinearity (3) Now, regress the log transformed variable price on the log transformed variables sqft AND bath. The model is a three dimensional linear regression model where you are predicting price as a plane (think of a piece of paper) above the axes including both sqft and bath. * Run a tidy lm on the log transformed variables price and both of sqft and bath from the dataset LAhomes. Use the formula: log(price) ~ log(sqft) + log(bath). * Now look at the coefficients separately. What happened to the signs of each of the coefficients? What happened to the significance of each of the coefficients? # Output the tidy model lm(log(price) ~ log(sqft) + log(bath), data = LAhomes) %>% tidy() # A tibble: 3 × 5 term estimate std.error statistic p.value 1 (Intercept) 2.51 0.262 9.60 2.96e- 21 2 log(sqft) 1.47 0.0395 37.2 1.19e-218 3 log(bath) -0.0390 0.0453 -0.862 3.89e- 1 The model still predicts that the square footage of the home has a significant positive relationship with price. However, now the number of bathrooms no longer has a significant relationship with the price. (Note also that the sign of the coefficient has changed.) -------------------------------------------------------------------------------------------------------------------------------------------- Inference on coefficients Using the NYC Italian restaurants dataset (compiled by Simon Sheather in A Modern Approach to Regression with R), restNYC, you will investigate the effect on the significance of the coefficients when there are multiple variables in the model. Recall, the p-value associated with any coefficient is the probability of the observed data given that the particular variable is independent of the response AND given that all other variables are included in the model. The following information relates to the dataset restNYC which is loaded into your workspace: + each row represents one customer survey from Italian restaurants in NYC + Price = price (in US$) of dinner (including tip and one drink) + Service = rating of the service (from 1 to 30) + Food = rating of the food (from 1 to 30) + Decor = rating of the decor (from 1 to 30) * Run a tidy lm regressing Price on Service. * Run a tidy lm regressing Price on Service, Food, and Decor. * What happened to the significance of Service when additional variables were added to the model? > glimpse(restNYC) Rows: 168 Columns: 7 $ Case 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, … $ Restaurant "Daniella Ristorante", "Tello's Ristorante", "Biricchino", … $ Price 43, 32, 34, 41, 54, 52, 34, 34, 39, 44, 45, 47, 52, 35, 47,… $ Food 22, 20, 21, 20, 24, 22, 22, 20, 22, 21, 19, 21, 21, 19, 20,… $ Decor 18, 19, 13, 20, 19, 22, 16, 18, 19, 17, 17, 19, 19, 17, 18,… $ Service 20, 19, 18, 17, 21, 21, 21, 21, 22, 19, 20, 21, 20, 19, 21,… $ East 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… # Output the first model lm(Price ~ Service, data = restNYC) %>% tidy() lm(Price ~ Service, data = restNYC) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -12.0 5.11 -2.34 2.02e- 2 2 Service 2.82 0.262 10.8 7.88e-21 # Output the second model lm(Price ~ Service + Food + Decor, data = restNYC) %>% tidy() # A tibble: 4 × 5 term estimate std.error statistic p.value 1 (Intercept) -24.6 4.75 -5.18 6.33e- 7 2 Service 0.135 0.396 0.341 7.33e- 1 3 Food 1.56 0.373 4.17 4.93e- 5 4 Decor 1.85 0.218 8.49 1.17e-14 When only Service is included in the model, it appears significant. However, once Food and Decor are added into the model, that is no longer the case. What is the correct interpretation of the coefficient on Service in the linear model which regresses Price on Service, Food, and Decor? Given that Food and Decor are in the model, Service is not significant, and we cannot know whether it has effect on modeling Price.