6 Multiple Linear Regression

6.1 Nitrate concentration

In order to analyze the effect of reducing nitrate loading in a Danish fjord, it was decided to formulate a linear model that describes the nitrate concentration in the fjord as a function of nitrate loading, it was further decided to correct for fresh water runoff. The resulting model was $$ Y_i = \beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+ \varepsilon_i , \qquad \varepsilon_i\sim N(0,\sigma^2) $$

where $Y_i$ is the natural logarithm of nitrate concentration, $x_{1,i}$ is the natural logarithm of nitrate loading, and $x_{2,i}$ is the natural logarithm of fresh water run off.

a)

Which of the following statements are assumed fulfilled in the usual multiple linear regression model?

$\varepsilon_i = 0$ for all $i = 1, ..., n,$ and $\beta_j$ follows a normal distribution
$E[x_1] = E[x_2] = 0$ and $V[\varepsilon_i] = \beta^2_1$
$E[\varepsilon_i] = 0$ and $V[\varepsilon_i] = \beta^2_1$ .
$\varepsilon_i$ is normally distributed with constant variance, and $\varepsilon_i$ and $\varepsilon_j$ are independent for $i \ne j$ . Yes
$\varepsilon_i = 0$ for all $i = 1, ..., n,$ and $x_j$ follows a normal distribution for $j = \{1, 2\}$

Answer: only 4. is correct, because this is the usual assumption about the errors

b)

The parameters in the model were estimated in R and the following results are available (slightly modified output from summary):

> summary(lm(y ~ x1 + x2))
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.36500 0.22184    -10.661 < 2e-16
x1           0.47621 0.06169    7.720   3.25e-13
x2           0.08269 0.06977    1.185   0.237
---
Residual standard error: 0.3064 on 237 degrees of freedom
Multiple R-squared: 0.3438,Adjusted R-squared: 0.3382
F-statistic: 62.07 on 2 and 237 DF, p-value: < 2.2e-16

What are the parameter estimates for the model parameters ( $\hat\beta_i$ and $\hat\sigma^2$ ) and how many observations are included in the estimation?

$\hat\beta_0 = -2.365 \\ \hat\beta_1 = 0.47621 \\ \hat\beta_2 = 0.08269 \\ \hat\sigma^2 = 0.3064^2$

And there are 240 observations included in the estimation, because the number of degrees of freedom (237) is equal $n-(p+1)$ , where $p=2$ (because of $\hat\beta_2$ )

c)

Calculate the usual 95% confidence intervals for the parameters ( $\beta_0$ , $\beta_1$ , and $\beta_2$ )

By method 6.5 confidence intervals for the parameters are given by:

$\hat\beta_i\pm t_{1-\alpha/2}\hat\sigma_{\beta_i}$

So:

$\hat\beta_0\in [-2.80203, -1.92797] \\ \hat\beta_1 \in [0.3546792, 0.5977408] \\ \hat \beta_2 \in [-0.05475858, 0.22013858]$

# The example of Beta0 calculation in R
alpha <- 0.05
beta <- -2.36500
sigmabeta <- 0.22184
c( beta - qt((1-alpha/2), df=237)*sigmabeta,
beta + qt((1-alpha/2), df=237)*sigmabeta )
    -2.80203 -1.92797

d)

On level $\alpha= 0.05$ which of the parameters are significantly different from 0, also find the p-values for the tests used for each of the parameters?

By using the summary in the R output we can conclude that only $\beta_0$ and $\beta_1$ are significantly different from 0, because these parameters have very low $\text{p-value}$ (very strong evidence against the null hypothesis in both cases)

6.2 Multiple linear regression model

The following measurements have been obtained in a study:

D <- data.frame(
x1=c(0.58, 0.86, 0.29, 0.20, 0.56, 0.28, 0.08, 0.41, 0.22, 0.35,
0.59, 0.22, 0.26, 0.12, 0.65, 0.70, 0.30, 0.70, 0.39, 0.72,
0.45, 0.81, 0.04, 0.20, 0.95),
x2=c(0.71, 0.13, 0.79, 0.20, 0.56, 0.92, 0.01, 0.60, 0.70, 0.73,
0.13, 0.96, 0.27, 0.21, 0.88, 0.30, 0.15, 0.09, 0.17, 0.25,
0.30, 0.32, 0.82, 0.98, 0.00),
y=c(1.45, 1.93, 0.81, 0.61, 1.55, 0.95, 0.45, 1.14, 0.74, 0.98,
1.41, 0.81, 0.89, 0.68, 1.39, 1.53, 0.91, 1.49, 1.38, 1.73,
1.11, 1.68, 0.66, 0.69, 1.98)
)

It is expected that the response variable $y$ can be described by the independent variables $x_1$ and $x_2$ . This imply that the parameters of the following model should be estimated and tested

$Y_i = \beta_0+\beta_1x_{1}+\beta_2x_{2}+ \varepsilon_i \ , \qquad \varepsilon_i\sim N(0,\sigma^2)$

a)

Calculate the parameter estimates ( $\beta_0$ , $\beta_1$ , $\beta_2$ , and $\sigma^2$ ), in addition find the usual 95% confidence intervals for $\beta_0$ , $\beta_1$ and $\beta_2$ .

> fit <-lm(y ~ x1 + x2, data = D) 
> summary(fit)
Call:
lm(formula = y ~ x1 + x2, data = D)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.15493 -0.07801 -0.02004  0.04999  0.30112 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.433547   0.065983   6.571 1.31e-06 ***
x1          1.652993   0.095245  17.355 2.53e-14 ***
x2          0.003945   0.074854   0.053    0.958    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1127 on 22 degrees of freedom
Multiple R-squared:  0.9399,    Adjusted R-squared:  0.9344 
F-statistic:   172 on 2 and 22 DF,  p-value: 3.699e-14

$\hat\beta_0 = 0.433547 \\ \hat\beta_1 = 1.652993 \\ \hat\beta_2 = 0.003945 \\ \hat\sigma^2 = 0.1127^2 \\$

> confint(fit, level=0.95)
                 2.5 %    97.5 %
(Intercept)  0.2967067 0.5703875
x1           1.4554666 1.8505203
x2          -0.1512924 0.1591822

$\hat\beta_0 \in [0.2967067, 0.570387] \\ \hat\beta_1 \in [1.4554666, 1.8505203] \\ \hat\beta_2 \in [-0.1512924, 0.1591822] \\$

b)

Still using confidence level $\alpha= 0.05$ reduce the model if appropriate

Since the confidence interval for $\beta_2$ cover zero and the $\text{p-value}$ is much larger than $0.05$ , the parameter should be removed from the model to get the simpler model:

$Y_i = \beta_0+\beta_1x_{1}+ \varepsilon_i \ , \qquad \varepsilon_i\sim N(0,\sigma^2)$

The parameter estimates in the simpler model are:

> fit <- lm(y ~ x1, data =D)
> summary(fit)
Call:
lm(formula = y ~ x1, data = D)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.15633 -0.07633 -0.02145  0.05157  0.29994 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.43609    0.04399   9.913 9.02e-10 ***
x1           1.65121    0.08707  18.963 1.54e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1102 on 23 degrees of freedom
Multiple R-squared:  0.9399,    Adjusted R-squared:  0.9373 
F-statistic: 359.6 on 1 and 23 DF,  p-value: 1.538e-15

And both parameters are now significant.

c)

Carry out a residual analysis to check that the model assumptions are fulfilled.

It seems that there are no strong evidence against the assumptions, the qq-plot is are a straight line and the are no obvious dependence between the residuals and the fitted values, and we conclude that the assumptions are fulfilled.

par(mfrow=c(1,2))
qqnorm(fit$residuals)
qqline(fit$residuals)
plot(fit$fitted.values, fit$residuals, main="Fitted vs residuals")

## Walley
library(MESS)

qqwrap <- function(x, y, ...){
  stdy <- (y-mean(y))/sd(y)
  qqnorm(stdy, main="", ...)
  qqline(stdy)}


wallyplot(fit$residuals, FUN=qqwrap, ylim=c(-3,3))

d)

Make a plot of the fitted line and 95% confidence and prediction intervals of the line for $x_1 \in [0, 1]$ (it is assumed that the model was reduced above).

par(mfrow=c(1,1))
x1new <- seq(0,1,by=0.01)
pred <- predict(fit, newdata=data.frame(x1=x1new),
                interval="prediction")
conf <- predict(fit, newdata=data.frame(x1=x1new),
                interval="confidence")
plot(x1new, pred[ ,"fit"], type="l", ylim=c(0.1,2.4),
     xlab="x1", ylab="Prediction")
lines(x1new, conf[ ,"lwr"], col="green", lty=2)
lines(x1new, conf[ ,"upr"], col="green", lty=2)
lines(x1new, pred[ ,"lwr"], col="red", lty=2)
lines(x1new, pred[ ,"upr"], col="red", lty=2)
legend("topleft", c("Prediction","Confidence band","Prediction band"),
       lty=c(1,2,2), col=c(1,3,2), cex=0.7)