e-Statistics

Model Selection and F-test

The multiple linear regression model

$\displaystyle Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik} + \epsilon_i,
\quad i=1,\ldots,n,
$

must be built by identifying specific variables with (i) the response variable $ Y_i$ and (ii) predictors from $ x_{i1}$ to $x_{ik}$.

  1. In order to start over again, you need to clear the model formula first.
  2. From the above data the column must be selected for the response variable $ Y_i$ (dependent variable).
  3. It builds a model formula for the predictors $ x_{i1}$ up to $x_{ik}$ (independent variables) in a form

    where we set predictor variables one by one for the model. A nonlinear transformation (e.g., log(x) or x^2) of the predictor x can be indicated by placing it in I(). For example, I(log(x)) or I(x^2).

The set of scatterplots for each pair of variables can be produced in a matrix form for the response $ Y_i$ and the explanatory variables $ x_{i1},\ldots,x_{ik}$. Collinearity appears in such a matrix as a close linear relation between a pair of the explanatory variables.

The objective of F-test is to determine whether the variable $x_{ik}$ in the full model $H_1$ has “some effects” or not in comparison with the sub model $ H_0$ "dropping" $x_{ik}$. Then the hypothesis test problem becomes

$\displaystyle H_0: Y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{k-1} x_{i,k-1} + \epsilon_i
$

versus

$\displaystyle H_1: Y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{k-1} x_{i,k-1} + \beta_{k} x_{i,k} + \epsilon_i
$

which represents "sub" and "full" model, respectively.

To proceed the hypothesis test, the sum of squares within the respective model must be formulated by

$\displaystyle SS_{m} = \sum_{i=1}^k (Y_{i} - \hat{y}_{i,m})^2
$

where $\hat{y}_{i,m}$'s are the fitted values under the model m=0 or 1. Under the sub model $ H_0$, the test statistic

$\displaystyle F = \frac{MS}{MS_{1}}
$

has the $ F$-distribution with $(df_{0}-df_{1}, df_{1})$ degree of freedom. Thus, we reject $ H_0$ with significance level $ \alpha$ if $F > F_{\alpha,df_{0}-df_{1},df_{1}}$. Or, equivalently we can compute the $ p$-value $ p^*$, and reject $ H_0$ if $ p^* < \alpha$.

Drop Source Degree of freedom Sum of squares Mean square F-statistic
<none> $H_1$ $df_{1}$ $SS_{1}$ $MS_{1}=\frac{SS_{1}}{df_{1}}$  
k-th column Between $ H_0$ and $H_1$ $df_{0}-df_{1}$ $SS_{0}-SS_{1}$ $MS = \frac{SS_{0}-SS_{1}}{df_{0}-df_{1}}$ $F = \frac{MS}{MS_{1}}$

The above table summarizes the analysis of variance (aov) table for model selection. Since we find a sub model $ H_0$ plausible when we fail to reject $ H_0$, we seek the highest p-value $ p^*$ and decide to drop the corresponding variable if $p^* > \alpha$.


© TTU Mathematics