e-Statistics

Test for Homogeneity

In a study where there are two characteristics the researchers want to know whether these two characteristics, say “A” and “B,” are linked or independent. For such study we have paired observations in categorical data of size n, which is summarized in the contingency table.

The first column of contingency table should list categorical values (or levels) for the characteristic “A”. Then the rest of columns correspond to categorical values (or responses) of the characteristic “B”, and provide the cell frequencies $X_{A,B}$ 's. The contingency table can be visualized by mosaic plot below. The area of the tiles in the mosaic plot is proportional to the number $X_{A,B}$ of observations for the response of B within the level of A. Thus, homogeneity can be indicated by the tiles of similar size across different levels of A.

The statement of null hypothesis becomes “the two characteristics are independent.” Let $n_{A,\cdot}$ and $n_{\cdot,B}$ denote the total counts of the respective value A and B (i.e., the raw and column sum in the contingency table). Under the null hypothesis, the expected frequencies for the contingency table are given by $E_{A,B} = (n_{A,\cdot}\times n_{\cdot,B}) / n$ where n denotes the total cell counts. Then the chi-square statistic is

$\chi^2 = \displaystyle\sum_{A} \sum_{B} \frac{(X_{A,B} - E_{A,B})^2}{E_{A,B}}$ =

Let and denote the number of categorical values in the row and the column, respectively. Then we should compare the statistic with chi-square distribution with degrees of freedom, and construct the critical region $x > \chi^2_{\alpha,df}$ to determine whether the null hypothesis can be rejected or not. Equivalently we can reject the null hypothesis (that is, we can find dependence and evidence of association of the two characteristics) if p-value is significant (that is, $p^* < \alpha$ ).