e-Statistics

Model Selection and F-test

The multiple linear regression model

$\displaystyle Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_i, \quad i=1,\ldots,n,$

must be built by identifying specific variables with (i) the response variable

and (ii) predictors from $x_{i1}$ to $x_{ik}$ .

In order to start over again, you need to clear the model formula first.
From the above data the column must be selected for the response variable (dependent variable).
It builds a model formula for the predictors $x_{i1}$ up to $x_{ik}$ (independent variables) in a form

where we set predictor variables one by one for the model. A nonlinear transformation (e.g., log(x) or x^2) of the predictor x can be indicated by placing it in I(). For example, I(log(x)) or I(x^2).

The set of scatterplots for each pair of variables can be produced in a matrix form for the response and the explanatory variables $x_{i1},\ldots,x_{ik}$ . Collinearity appears in such a matrix as a close linear relation between a pair of the explanatory variables.

The objective of F-test is to determine whether the variable $x_{ik}$ in the full model $H_1$ has “some effects” or not in comparison with the sub model "dropping" $x_{ik}$ . Then the hypothesis test problem becomes

$\displaystyle H_0: Y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{k-1} x_{i,k-1} + \epsilon_i$

versus

$\displaystyle H_1: Y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{k-1} x_{i,k-1} + \beta_{k} x_{i,k} + \epsilon_i$

which represents "sub" and "full" model, respectively.

To proceed the hypothesis test, the sum of squares within the respective model must be formulated by

$\displaystyle SS_{m} = \sum_{i=1}^k (Y_{i} - \hat{y}_{i,m})^2$

where $\hat{y}_{i,m}$ 's are the fitted values under the model m=0 or 1. Under the sub model

, the test statistic

$\displaystyle F = \frac{MS}{MS_{1}}$

has the

-distribution with $(df_{0}-df_{1}, df_{1})$ degree of freedom. Thus, we reject

with significance level $\alpha$ if $F > F_{\alpha,df_{0}-df_{1},df_{1}}$ . Or, equivalently we can compute the

-value

, and reject

if $p^* < \alpha$ .

Drop	Source	Degree of freedom	Sum of squares	Mean square	F-statistic
<none>		$df_{1}$	$SS_{1}$	$MS_{1}=\frac{SS_{1}}{df_{1}}$
k-th column	Between and	$df_{0}-df_{1}$	$SS_{0}-SS_{1}$	$MS = \frac{SS_{0}-SS_{1}}{df_{0}-df_{1}}$	$F = \frac{MS}{MS_{1}}$

The above table summarizes the analysis of variance (aov) table for model selection. Since we find a sub model plausible when we fail to reject , we seek the highest p-value and decide to drop the corresponding variable if $p^* > \alpha$ .